84
SAMSI Working Group March 2007 ISSUES IN THE DESIGN AND ANALYSIS OF COMPUTER EXPERIMENTS David M. Steinberg Tel Aviv University

SAMSI Working Group March 2007 ISSUES IN THE DESIGN AND ANALYSIS OF COMPUTER EXPERIMENTS David M. Steinberg Tel Aviv University

Embed Size (px)

Citation preview

SAMSI Working Group

March 2007

ISSUES IN THE DESIGN AND ANALYSIS OF COMPUTER

EXPERIMENTS

David M. Steinberg

Tel Aviv University

SAMSI Working Group

March 2007

THANK YOUSNoga Alon

Ronit Steinberg

COLLABORATORSDennis LinDizza BursztynRon KenettHenry WynnRon BatesSigal LevyEinat Neuman Ben Ari

Gideon LeonardTamir ReisinEyal HashaviaZeev Somer

SAMSI Working Group

March 2007

1. Some Applications• Nuclear Waste Repository• Ground Response to an Earthquake• Chemotherapy Simulator• Optimizing a Piston

2. Designing Computer Experiments3. Latin Hypercube Designs4. Rotated Factorial Designs5. LHD’s as Rotated Factorial Designs6. Near LHD’s from Rotated Factorials7. Nuclear Waste Disposal: Quandaries8. Chemotherapy: Quandaries9. Ground Shaking: Quandaries10.GASP Models and Bayesian Regression

PREVIEW

SAMSI Working Group

March 2007

Example: Nuclear Waste Repository

RESRAD computes leaching of radioactive isotopes from the repository into the food and water supply.

Time frame is thousands of years, so field study is impossible.

SAMSI Working Group

March 2007

Inputs

• Initial isotope concentrations

• Distribution coefficients of the isotopes

• Lithology of the repository

Outputs

• Maximal dose during 10,000 years

Example: Nuclear Waste Repository

SAMSI Working Group

March 2007

What will be the ground response to an earthquake?

An engineering simulator uses a finite element scheme to simulate ground motion. Shaking of the bedrock generates surface motion.

We wish to study the output from the program to aid earthquake preparedness plans.

Example: Ground Shaking

SAMSI Working Group

March 2007

Inputs• Geometry of the ground surface• Layers of hard/soft soil below the surface• Shear velocity, density, elasticity of the soil

in each layer• Amplitude and spectrum at bedrock

Outputs• Displacement along the surface• Acceleration along the surface

Example: Ground Shaking

SAMSI Working Group

March 2007

What is the effect of chemotherapy treatment?

The treatment affects both cancerous and healthy cells in the body.

The goal is to develop treatment protocols that will put the cancer into remission with minimal damage to healthy cells.

Example: Chemotherapy Simulator

SAMSI Working Group

March 2007

Inputs

• Treatment protocol: dosage and timing

• Rate of drug decay

• Rate of cell death

• Rate of cell regeneration

Outputs

• Number of healthy and malignant cells, as a fraction of the initial count

Example: Chemotherapy Simulator

SAMSI Working Group

March 2007

The piston simulator was written by Kenett and Zacks as a teaching tool for their text book.

The simulator describes the cycle time of a piston and is based on the physics governing the piston.

Variation in output is related to tolerances in the inputs.

The goal was to achieve a target cycle time with minimal variation.

Example: Piston Performance

SAMSI Working Group

March 2007

Output

• Cycle time

Example: Piston Performance

C: Initial Gas Volume (m3)

B: Piston Surface Area (m2)A: Piston Weight (Kg)

D: Spring Coefficient (N/m)

E: Atmospheric Pressure (N/m2)F: Ambient Temperature (0K)

G: Gas Temperature (0K)

C: Initial Gas Volume (m3)C: Initial Gas Volume (m3)

B: Piston Surface Area (m2)A: Piston Weight (Kg)B: Piston Surface Area (m2)A: Piston Weight (Kg)A: Piston Weight (Kg)

D: Spring Coefficient (N/m)

E: Atmospheric Pressure (N/m2)F: Ambient Temperature (0K)

G: Gas Temperature (0K)

SAMSI Working Group

March 2007

Latin Hypercube Designs

Latin Hypercubes are the most popular class of experimental plan.

LHD’s place the input levels for each factor on a uniform grid.

Then “mate” the levels across factors by randomly permuting the column for each factor.

McKay, Beckman and Conover, Technometrics, 1979.

SAMSI Working Group

March 2007

Latin Hypercube Designs

Example of a Latin Hypercube design for 3 factors.

Initial Grids Shuffled Grids1 1 1 1 0.3 0.5

0.9 0.9 0.9 0.9 0.4 0.20.8 0.8 0.8 0.8 1 0.70.7 0.7 0.7 0.7 0.6 00.6 0.6 0.6 0.6 0.2 10.5 0.5 0.5 0.5 0.7 0.90.4 0.4 0.4 0.4 0 0.10.3 0.3 0.3 0.3 0.9 0.60.2 0.2 0.2 0.2 0.5 0.40.1 0.1 0.1 0.1 0.8 0.8

0 0 0 0 0.1 0.3

SAMSI Working Group

March 2007

Latin Hypercube Designs

Some 2-factor projections from a 250-run LHD.

SAMSI Working Group

March 2007

Latin Hypercube Designs

Other mating schemes have been suggested to obtain columns with low correlation.

Ye showed how to get 2m-2 fully orthogonal columns with 2m runs.

Butler showed how to get orthogonality with respect to a trigonometric regression model and 2m runs.

How many orthogonal columns are possible?

SAMSI Working Group

March 2007

Rotated Factorial Designs

Bursztyn and Steinberg developed experimental plans with many levels in which linear effects are orthogonal.

Start with a “standard” first-order orthogonal design, like a 2k-p fractional factorial: D.

“Rotate” the design using a rotation matrix R: D DR.

Then (DR)’(DR) = R’D’DR = nR’R = nI.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

Steinberg and Lin showed how to rotate two-level factorials into Latin Hypercube designs with a large number of first-order orthogonal columns.

This work combines a rotation idea in Bursztyn and Steinberg with another rotation idea developed by Lin and Beattie.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial DesignsLin and Beattie: rotate 2k factorials to Latin

Hypercube designs. The intuition: Columns in a LHD are an arithmetic

sequence. Columns in DR are linear combinations of the

rows of D (the 2k design). The rows of D are a binary expansion of the

odd integers. Using appropriate powers of 2 as the

elements in R, each column in DR is an integer sequence.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

-1 -1 -11 -1 -1-1 1 -11 1 -1-1 -1 11 -1 1-1 1 11 1 1

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

2 -4 1

-1 -1 -11 -1 -1-1 1 -11 1 -1-1 -1 11 -1 1-1 1 11 1 1

Weights

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

Weights 2 -4 1

-1 -1 -1 11 -1 -1 5-1 1 -1 -71 1 -1 -3-1 -1 1 31 -1 1 7-1 1 1 -51 1 1 -1

Weighted Sums

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial DesignsLin and Beattie: rotate 2k factorials to Latin

Hypercube designs. Can we organize weights for multiple

columns in a rotation matrix R? Yes – provided R is t by t, where t is a

power of 2. A simple recursive scheme gives the

rotation matrices. Original proposal limited to full factorial

designs 2k, where k is a power of 2.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial DesignsLin and Beattie: rotate 2k factorials to Latin

Hypercube designs.

10 R

12

21

5

11R

jj

jjj

RR

RRR j

j

1

1

2

2

12

2

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

Bursztyn and Steinberg showed that fractional factorial designs can also be rotated.

First, the design must be decomposed into sets of factors, each of which is a full factorial.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

Steinberg and Lin:

RDRD

R

R

DDDR

t

t

||

00

00

00

||

1

1

The resulting design is an orthogonal Latin hypercube.

Bursztyn & Steinberg

Lin & Beattie

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

The construction requires that each set of columns be a full factorial design.

Suppose we start with a saturated fractional factorial with 2m runs.

How can we “group” the columns to achieve the maximum number of full factorials?

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

We can order the columns so that each set of m consecutive columns is a full factorial.

1. Identify the columns as the non-zero points in GF(2m).

2. All non-zero points (hence all columns) can be obtained as xj mod p(x), where p(x) is a primitive polynomial of GF(2m).

3. Order the columns by the order of the powers.

4. A set of m consecutive columns is not a full factorial if it as a linear dependency. Easy to show that this implies a linear dependency in the first m columns.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

1. Identify the columns as the non-zero points in GF(2m), the Galois Field of binary vectors of length m.

The column of 1’s is matched with (0,0,…,0).

The column for A is matched with (1,0,…,0).

The column for B is matched with (0,1,0,…,0).

The column for AB is matched with (1,1,0,…,0).

In general, the column for any interaction is matched with a vector with 1’s marking the factors involved in the interaction.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

1. Identify the columns as the non-zero points in GF(2m), the Galois Field of binary vectors of length m.

Each binary vector is used to represent a polynomial with binary coefficients.

AC (1,0,1,0,0,0) 1 + x2

BDF (0,1,0,1,0,1) x + x3 + x5

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

2. All non-zero points (hence all columns) can be obtained as xj mod p(x), where p(x) is a primitive polynomial of GF(2m).

GF theory – there exists a primitive polynomial, p(x), that can be used to generate all the non-zero polynomials in GF(2m).

The primitive polynomial is a binary polynomial of degree m. Recall that m is the number of factors, so we want to generate all polynomials of degree m-1 or less.

All calculations are carried out modulo 2.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

2. All non-zero points (hence all columns) can be obtained as xj mod p(x), where p(x) is a primitive polynomial of GF(2m).

For example, with m=4, a primitive polynomial is 1+x+x4.

x0 ≡ 1 (A) x1 ≡ x (B) x2 ≡ x2 (C) x3 ≡ x3 (D)

x4 ≡ 1+x (AB) x5 ≡ x+x2 (BC) etc.

If we continue, we find all the non-zero polynomials.

Every set of m successive columns is a full factorial.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

The rotated designs are a special class of Latin Hypercubes with an external orthogonal array structure (U-designs).

For each pair of columns, ¼ of all the points are in each quadrant.

For many pairs, finer divisions hold.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

Some 2-factor projections from the design of the ground-shaking study.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

Points may “clump” in low-dimensional projections.

In high dimensions, points do not clump.

The rotation is isometric, so the inter-point differences are like those in the original factorial, except for “shrinking” the final design back to a hypercube.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

Steinberg and Lin show that these rotated designs have good statistical properties as screening designs.

Main effects have low aliasing with second order effects (by comparison with randomly mated LHC designs or randomly chosen U-designs).

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

Suppose you use the design to fit a simple first-order regression model, to “screen” the most influential factors:

Y = Xβ + ε.

But the true dependence involves additional regression terms:

Y = Xβ + Zγ.

Then β-hat = β + (X’X)-1X’Zγ = β + Aγ .

The matrix A is known as the alias matrix.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

The alias matrix depends on the design, the model used for screening, and the extra terms in Z.

A good screening design should have small values in A for simple screening models and somewhat more complex extra terms.

Bursztyn and Steinberg, JSPI (2006), 1103-1119,

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

We compared 16-run, 12-factor designs, with a first-order screening model and extra terms of second order.

The alternatives: a standard LHD (best of 100 random choices) and an OA-based LHD (best of 100 random choices).

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

The percent of entries in A that were < 0.1:

Two-factor interactions

Pure Quadratics

Orthogonal LHD

65.0%74.3%

Standard LHD

30.7%50.7%

OA-based LHD

52.7%45.8%

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

For the standard and OA-based LHD’s, the results shown are the best found for 100 random designs.

For the orthogonal LHD, all non-isomorphic groupings of columns into 3 sets of 4 columns were found. Results were very similar for all groupings.

SAMSI Working Group

March 2007

LHD’s as Rotated Factorial Designs

A design with n/2 columns, all orthogonal to each other and to all possible second-order effects, can be constructed using the same ideas.

The trick is in the choice of the starting design.

We rotate the resolution IV “foldover” design. The rotation preserves the foldover property and that, in turn, guarantees the orthogonality properties.

The GF(2m) structure again provides a way to group the columns into full factorials.

SAMSI Working Group

March 2007

Near LHD’s from Rotated Factorials

Orthogonal designs that are nearly LHD’s can be obtained by rotating other base designs.

Example: use as the base design the 48 run Plackett-Burman design.

Rotate 40 factors in 5 groups of 8.

The rotated design has all columns orthogonal. It is also a U-design.

It is nearly a Latin Hypercube.

SAMSI Working Group

March 2007

Near LHD’s from Rotated Factorials

Below is a q-q plot for one of the factors against a uniform distribution.

SAMSI Working Group

March 2007

Nuclear Waste Repository: Quandaries• Main goal is to assess which input factors

have greatest influence on output: Sensitivity Analysis.

• For example: given a proposed site, which factors should be measured?

• Output data are highly skewed, with many 0’s (configurations with no leaching into the drinking water).

• What is the best way to summarize the results?

SAMSI Working Group

March 2007

RESRAD RESRAD is a computer model designed

to estimate radiation doses and risks from RESidual RADioactive materials.

RESRAD simulates radiation doses and cancer risks for a variety of pathways in the environment (e.g. drinking water, food chain, atmosphere).

Developed at Argonne National Laboratory. http://web.ead.anl.gov/

resrad/

SAMSI Working Group

March 2007

RESRAD

Number of input parameters can reach hundreds.

Most parameters are difficult/expensive to measure or control and are subject to wide ranges of uncertainty.

SAMSI Working Group

March 2007

RESRADTypical RESRAD output.

0.00E+01

1.00E+04

2.00E+04

3.00E+04

4.00E+04

5.00E+04

10 100 1000 10000

Years

Prob_run.RAD 07/04/2004 11:53 Includes All Pathways

DOSE: U-238, All Pathways Summed

SAMSI Working Group

March 2007

Our Case Study Twenty-seven input parameters.

Initial radionuclide is U238 buried at a depth of 2 meters.

Lithology is one-dimensional, with contaminated, unsaturated and saturated layers above groundwater.

SAMSI Working Group

March 2007

Our Case Study Wide uncertainties for inputs. Many have log-normal distributions as a

reflection of scientific uncertainty. The distribution coefficients for U234

and U238 should be identical.

Outcome: maximal annual dose during 10k years.

SAMSI Working Group

March 2007

Our Case Study

Use RESRAD’s built-in capability for sensitivity analysis.

Options include: One-factor-at-a-time analysis. Random samples of input settings. Latin Hypercube samples. Different input parameter distributions

(e.g. uniform, normal, log-normal). Specified rank correlations of inputs.

SAMSI Working Group

March 2007

Our Case Study

Limitations include: Inability to enforce equality of inputs. Limited ability to trace dose across time. Built-in analyses.

SAMSI Working Group

March 2007

Our Case StudyWe generated 900 training points, using 3 LHS’s of

300 runs each.Most inputs were sampled from lognormal

distributions.The Kd’s for U238 and U234 were given a rank

correlation of 0.99.The Kd’s of the same isotope in different layers were

given rank correlations of 0.3.A separate test set of 300 test points was generated

from 3 100-run LHS’s.A second test set with some of the original inputs at

fixed values.

SAMSI Working Group

March 2007

Our Case StudyA typical plot of the output vs. a strong input.

U238 Kd, Unsaturated Layer

Ma

x D

ose

10 50 100 500

05

00

01

00

00

15

00

0

SAMSI Working Group

March 2007

Our Case StudyA typical plot of the output vs. a strong

input.

Soil Depth

Ma

x D

ose

0 50 100 150 200

05

00

01

00

00

15

00

0

SAMSI Working Group

March 2007

Our Case StudyOn the 900 training points, 76% had no

migration at all into the water supply.The migration on the remaining 24% was

highly skewed.

Quantiles of Standard Normal

Ma

x D

ose

-3 -2 -1 0 1 2 3

05

00

01

00

00

15

00

0

Quantiles of Standard Normal

Ma

x D

ose

-3 -2 -1 0 1 2 3

10

^-2

51

0^-

20

10

^-1

51

0^-

10

10

^-5

10

^-1

10

^3

SAMSI Working Group

March 2007

Our Case StudyBelow is a normal plot of the log maximal

dose for samples with a maximal dose of at least 0.01.

Quantiles of Standard Normal

Ma

x D

ose

-2 -1 0 1 2

10

^-2

10

^-1

10

^01

0^1

10

^21

0^3

10

^4

SAMSI Working Group

March 2007

Our Case StudyRESRAD provides automatic sensitivity

output, which includes: Partial correlation and regression

coefficient of outcome with each input. Rank correlation and rank regression

coefficient of outcome with each input.

Often these measures point to quite different inputs as being most influential.

SAMSI Working Group

March 2007

Coefficient = PCC SRC PRCC SRRC

Repetition = 1 1 1 1

_______________________________________________ _________ _________ _________

Description of Probabilistic Variable Sig Coeff Sig Coeff Sig Coeff Sig Coeff

_________________________________________ _____ ___ _____ ___ _____ ___ _____

Concentration of U-238 7 0.07 11 0.07 17 -0.04 21 -0.03

Kd of U-238 in Contaminated Zone 10 -0.06 2 -0.18 23 0.01 16 0.04

Kd of U-238 in Unsaturated Zone 1 13 -0.06 6 -0.14 19 -0.02 12 -0.07

Kd of U-238 in Saturated Zone 20 0.03 9 0.08 10 -0.07 3 -0.21

Kd of U-234 in Contaminated Zone 12 0.06 4 0.18 20 -0.02 14 -0.06

Kd of U-234 in Unsaturated Zone 1 26 -0.01 23 -0.01 11 -0.06 4 -0.20

Kd of U-234 in Saturated Zone 6 -0.08 1 -0.22 18 0.02 13 0.07

Kd of Th-230 in Contaminated Zone 23 -0.01 24 -0.01 16 0.04 20 0.03

Kd of Th-230 in Unsaturated Zone 1 5 -0.08 10 -0.08 14 0.04 18 0.03

Kd of Th-230 in Saturated Zone 3 0.12 7 0.12 27 0.00 27 0.00

Kd of Ra-226 in Contaminated Zone 9 -0.07 12 -0.07 9 -0.09 11 -0.07

Kd of Ra-226 in Unsaturated Zone 1 22 -0.02 22 -0.02 4 -0.19 6 -0.15

Kd of Ra-226 in Saturated Zone 24 0.01 25 0.01 13 -0.05 17 -0.04

Kd of Pb-210 in Contaminated Zone 15 0.05 16 0.05 15 -0.04 19 -0.03

Kd of Pb-210 in Unsaturated Zone 1 14 0.05 15 0.05 26 0.01 26 0.01

Kd of Pb-210 in Saturated Zone 19 0.03 19 0.03 25 -0.01 25 -0.01

Precipitation 8 0.07 13 0.06 2 0.29 2 0.22

Saturated zone hydraulic conductivity 2 0.17 5 0.16 8 0.09 10 0.07

Saturated zone hydraulic gradient 17 0.04 18 0.04 21 -0.02 22 -0.01

Well pump intake depth 21 0.03 21 0.02 12 0.06 15 0.04

Well pumping rate 18 0.03 20 0.03 22 0.02 23 0.01

Thickness of Unsaturated zone 1 1 -0.19 3 -0.18 1 -0.52 1 -0.44

Hydraulic Conduct of Unsat zone 1 25 0.01 26 0.01 24 -0.01 24 -0.01

Total Porosity 27 0.00 27 0.00 5 0.18 7 0.14

Saturated zone total porosity 4 0.11 8 0.10 6 0.18 8 0.13 Effective Porosity 11 -0.06 14 -0.06 3 -0.26 5 -0.19

Saturated zone effective porosity 16 -0.04 17 -0.04 7 -0.16 9 -0.11

____________________________________ _____ _____ _____ _____

R-SQUARE 0.16 0.16 0.48 0.48

____________________________________ _____ _____ _____ _____

SAMSI Working Group

March 2007

Our Case StudyThe partial correlations and regressions are

dominated by a small number of very large doses.

The rank analyses largely ignore the large dose information.

The partial correlations have trouble with highly correlated inputs.

The partial regressions may overstate the importance of highly correlated inputs.

All the measures consider only linear dependence.

SAMSI Working Group

March 2007

Our Case StudyWe applied a two-phase analysis:

1. Find which inputs are associated with having a maximal dose of at least 0.1.

2. Among doses of at least 0.1, find which inputs are associated with high doses.

SAMSI Working Group

March 2007

Presence/AbsenceThe first analysis treats the outcome as

binary.

Contamination of at least 0.1 was found in 18% of the training cases.

Logistic regression and CART were used to fit predictive models for having a maximal dose of at least 0.1.

SAMSI Working Group

March 2007

U234 Kd’s were not used due to the high correlation with U238 Kd’s.

Ten input factors were included in the final logistic model, along with some quadratic terms and some interactions.

Presence/Absence

SAMSI Working Group

March 2007

Below is a plot of presence of contamination in the test data vs. phat.

Phat

Co

nta

min

atio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Presence/Absence

SAMSI Working Group

March 2007

The CART model was not as successful.

It often exploited “unimportant” variables for final splits.

It often ignored important variables.

Overall the dependence of “presence of contamination” on the inputs appears to be too smooth to be picked up well by CART.

Other methods could certainly be tried.

Presence/Absence

SAMSI Working Group

March 2007

The following table summarizes the success on the test data.

phatContamination

Logistic

Contamination

CART

<0.00010/11910/126

<0.0011/15410/126

<0.014/18113/175

<0.110/21219/211

<0.212/22419/219

>0.537/4931/51

Presence/Absence

SAMSI Working Group

March 2007

Max DoseTo predict the maximal dose, when

contamination is present, several different regression models were run after transforming to a log scale.

We also fitted a GASP model using the PeRK Software from Brian Williams at Los Alamos National Labs.

SAMSI Working Group

March 2007

The final regression model included 14 input factors, with mostly linear effects.

Regression Prediction

Ma

x D

ose

-2 -1 0 1 2 3 4

10

^-2

10

^-1

10

^01

0^1

10

^21

0^3

10

^4The mean error on the test data (with contamination) was 0.08 with a SD of 0.74.

Training data SD was 0.70.

Max Dose

SAMSI Working Group

March 2007

Linear regression with all inputs has a test case SD of 0.82.

GASP Prediction

Max

Dos

e

-1 0 1 2 3 4

10^-

210

^-1

10^0

10^1

10^2

10^3

10^4GASP model

using 10 main inputs.

Mean error 0.10.

SD 0.86.

Max Dose

SAMSI Working Group

March 2007

Projection pursuit regression model .

Mean error 0.03.

SD 0.83.

Similar to linear regression.

Projection Pursuit Prediction

Ma

x D

ose

-1 0 1 2 3

10

^-2

10

^-1

10

^01

0^1

10

^21

0^3

10

^4

Max Dose

SAMSI Working Group

March 2007

Nuclear Waste Repository: Quandaries• What is the best way to summarize the

results?

• Some factors were influential in determining the maximal dose, if present, but were not important for presence/absence.

• An important question is “how deep” an input configuration sits in the “no migration” region.

SAMSI Working Group

March 2007

Chemotherapy: Quandaries

• Both of these studies have multivariate output.

• In the chemotherapy study, we generate curves (vs. time) of the cell concentrations.

• In the ground shaking study, we get output at a grid of spatial values (on the ground surface). At each site, we have motion, velocity and acceleration as a function of frequency.

Ground Shaking: Quandaries

SAMSI Working Group

March 2007

Chemotherapy: Quandaries

• What are effective ways to summarize this high dimensional output?

Ground Shaking: Quandaries

SAMSI Working Group

March 2007

• Approach has been to compute simple low dimensional summaries.

• Focus on acceleration (of highest engineering importance).

• Summarize across frequencies by computing root mean square acceleration.

• Model RMS acceleration as a function of the spatial locations.

Ground Shaking: Quandaries

SAMSI Working Group

March 2007

Chemotherapy: Quandaries

• Data are much more “dense” in time than in the input factor space.

• We have looked at several methods for explicitly modeling the time dependence, then modeling those functions in terms of the input factors.

SAMSI Working Group

March 2007

Chemotherapy is given for 11 hrs: A (virtual!) patient is exposed to 3.95 mg of a steroid that decomposes at a rate of 1.487 mg/(cm3hr). Cancer cells grow at a rate of 0.0697.

We track the number of cancer cells in the patient’s body throughout the treatment duration, recording results every 6 minutes.

The result is a time dependent curve.

SAMSI Working Group

March 2007

Chemotherapy Data

0 20 40 60 80 100

0.9

85

0.9

90

0.9

95

SAMSI Working Group

March 2007

Chemotherapy data for several protocols

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

0 100 200 300 400

0.0

0.4

0.8

SAMSI Working Group

March 2007

Chemotherapy data for several protocols

What is a good approach to model the data?

Some options:

• Fit a B-spline to each curve, then model the parameters as a function of the inputs. Might add constraints using models for specific time points.

• Derive basis functions from the observed curves via functional cluster analysis. Use this on the “raw data” or to “scaled” data?

SAMSI Working Group

March 2007

0 100 200 300 400

0.70

0.80

0.90

0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

0 100 200 300 400

0.5

0.6

0.7

0.8

0.9

0 100 200 300 400

0.5

0.6

0.7

0.8

0 100 200 300 400

0.5

0.6

0.7

0.8

0.9

0 100 200 300 400

0.6

0.7

0.8

0 100 200 300 400

0.5

0.6

0.7

0.8

0.9

0 100 200 300 400

0.92

0.96

1.00

1.04

0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

0 100 200 300 400

0.4

0.5

0.6

0.7

0.8

0 100 200 300 400

0.2

0.4

0.6

0.8

0 100 200 300 400

0.4

0.6

0.8

0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

1.0

0 100 200 300 4000.

800.

901.

000 100 200 300 400

0.80

0.90

0 100 200 300 400

0.2

0.4

0.6

0.8

0 100 200 300 400

0.6

0.7

0.8

0.9

0 100 200 300 400

0.2

0.4

0.6

0.8

0 100 200 300 400

0.65

0.75

0.85

0 100 200 300 400

0.6

0.7

0.8

0.9

1.0

Results from the B-spline models, on 20 independent test settings.

SAMSI Working Group

March 2007

Data driven basis functions

1. Select k - number of basis functions

2. Define distance function

3. Cluster data into k disjoint groups

4. Use cluster means as basis functions

SAMSI Working Group

March 2007

Shape V. Scale

Consider these functions:

SAMSI Working Group

March 2007

After transformation

0 100 200 300 400

1.0

1.2

1.4

1.6

1.8

2.0

SAMSI Working Group

March 2007

4 clusters for chemotherapy data

0 100 200 300 400

1.0

1.2

1.4

1.6

1.8

2.0

SAMSI Working Group

March 2007

First stage results

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 4001

.01

.41

.80 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

0.8

1.2

1.6

2.0

0 100 200 300 400

1.0

1.4

1.8

0 100 200 300 400

1.0

1.4

1.8

2.2