Group Confidence Limit Plots Using PROC GPLOT and PROC

1

Group Confidence Limit Plots Using PROC GPLOT and PROC SGPLOT Robert A. Rutledge, Sun Microsystems, Menlo Park, CA

ABSTRACT It is often necessary to create a plot of means for each of several sub-populations, for example a trend plot of the mean defect rate for units produced in each month of the year, or the mean defect rate during the same month for each of several vendors. It is important to be able to distinguish significant differences from random variation in these plots, and one very effective way to do this is to add confidence limits to each of the mean values. The calculation of the confidence limits depends on the assumed distribution of the response variables, for example

• The Normal Distribution – which is appropriate for most continuous variables. • The Binomial Distribution – which is appropriate for pass/fail variables. • The Poisson Distribution – which is appropriate for counts-per-unit variables, such as the numbers of defects

on a silicon wafer. This paper shows how to create confidence limit plots for each of these distributions, first using PROC GPLOT, and then, with much less effort, using PROC SGPLOT.

1. INTRODUCTION Figure 1 shows a trend plot of average resistance by month of test, with 90% confidence limits on the mean values. The average resistance increased in November and then decreased again in December. You can see that the increase from October to November is statistically significant because the lower confidence limit for November is much higher than the upper confidence limit for October. However the decrease in December is not significant because the confidence limits on the December average include the averages from October and November. The wide confidence limits for December are due to the fact that the sample size is much smaller – there were only 5 units tested in December vs. 400 units each in October and November. This kind of data is very common because the most recent time period often does not yet have complete test results. Figure 1: Trend Plot with Confidence Limits

Section 2 shows how to compute the required confidence limits, and Sections 3 and 4 show how to create the plot shown in Figure 1 using PROC GPLOT or PROC SGPLOT.

2

2. COMPUTING CONFIDENCE LIMITS The examples in this paper use a SAS data set, Results, which contains test results, by month of test, for three different test measurements. The first six rows of Results are shown below. The full data set has 805 rows – 400 each for 2009-10 and 2009-11, and 5 rows for 2009-12. Results (First Six Rows) Month Resistance Fail Defects 2009-10 19.41 0 1 2009-10 19.30 0 1 2009-10 19.01 0 3 2009-10 19.06 0 3 2009-10 19.65 1 1 2009-10 20.50 0 0 The Resistance variable is assumed to follow the normal distribution. The Fail variable takes only the values 0 and 1, which represent ‘Pass’ and ‘Fail’ respectively, and is assumed to follow the Bernoulli distribution. Therefore the number of failures (the sum of the Fail variable) for each month follows the binomial distribution. The Defects variable is assumed to follow the Poisson distribution.

NORMAL DISTRIBUTION You can use the LCLM and UCLM keywords of PROC MEANS to compute confidence limits for the mean Resistance based on the normal distribution.

PROC MEANS DATA=Results NOPRINT NWAY ALPHA=.10; CLASS Month; VAR Resistance; OUTPUT OUT=Tab MEAN(Resistance) = Mean LCLM(Resistance) = LCL UCLM(Resistance) = UCL; RUN;

The ALPHA=.10 option specifies two-sided 90% confidence limits. The LCLM and UCLM keywords are used to compute the 95% lower and upper confidence limits on mean Resistance and store the results in the LCL and UCL variables respectively. The output data set, Tab, is shown below. Tab Month Mean LCL UCL 2009-10 19.36 19.20 19.52 2009-11 20.36 20.20 20.53 2009-12 19.64 18.18 21.09

POISSON DISTRIBUTION PROC MEANS cannot be used to compute Poisson or binomial confidence limits directly, but these limits are easily computed from a PROC MEANS summary using the CINV or BETAINV functions. For the Defects variable, first use PROC MEANS to compute the count, sum and mean for each month.

PROC MEANS DATA=Results NOPRINT NWAY; CLASS Month; VAR Resistance; OUTPUT OUT=Tab N(Defects) = N SUM(Defects) = Sum Mean(Defects) = Mean; FORMAT Mean 8.2; RUN;

It is assumed that the number of defects on each component has the Poisson distribution with mean D, where D is the unknown defect rate for the corresponding month. Therefore, the total number of defects in each month, Sum, has the Poisson distribution with mean N*D. The (1-α)% confidence limits for D can be computed as [CINV(α/2, 2*Sum)/2N, CINV(1-α/2, 2*(Sum+1))/2N]

3

where CINV(p, df) is the SAS function which returns the p-th quantile of a chi-squared distribution with df degrees of freedom. This data step computes the confidence limits.

DATA Tab; Set Tab; FORMAT LCL UCL 8.2; IF Sum=0 THEN LCL=0; IF Sum>0 THEN LCL=CINV(.05, 2*Sum)/(2*N); UCL=CINV(.95, 2*(Sum+1))/(2*N); RUN;

In the case where Sum=0, the lower confidence limit does not exist, but it is convenient to set LCL to zero so that a line from zero to UCL, the 95% upper confidence limit, can be plotted. Tab Month N Sum Mean LCL UCL 2009-10 400 720 1.80 1.69 1.91 2009-11 400 791 1.98 1.86 2.10 2009-12 5 14 2.80 1.69 4.38

BINOMIAL DISTRIBUTION For the Fail variable, the first step is again to compute the count, sum and mean for each month.

PROC MEANS DATA=Results NOPRINT NWAY; CLASS Month; VAR Resistance; OUTPUT OUT=Tab N(Fail) = N SUM(Fail) = Sum Mean(Fail) = Mean; FORMAT Mean Percent8.1; RUN;

If the probability of failure is equal to P, then the total number of failures in each month, Sum, has the binomial distribution with parameters (N,P). The (1-α)% confidence limits for P can be computed as [BETAINV(α/2, Sum, N+1-Sum), BETAINV(1-α/2, Sum+1, N-Sum)] where BETAINV(p, a, b) is the SAS function which returns the p-th quantile of a beta distribution with parameters a and b. This data step computes the confidence limits.

DATA Tab; Set Tab; FORMAT LCL UCL Percent8.1; IF Sum=0 THEN LCL=0; IF Sum>0 THEN LCL=BETAINV(.05, Sum, N+1-Sum); IF Sum=N then UCL=1; IF Sum<N THEN UCL=BETAINV(.95, Sum+1, N-Sum); RUN;

If Sum = 0, LCL is set to 0, and if Sum=N, UCL is set to N, to enable plotting of the line from LCL to UCL. Tab Month N Sum Mean LCL UCL 2009-10 400 37 9.3% 7.0% 12.0% 2009-11 400 42 10.5% 8.1% 13.4% 2009-12 5 3 60.0% 18.9% 92.4% The methods explained here can be used to create a data set, Tab, containing the required plot points, Mean, LCL and UCL, for a variable which has the normal Poisson or binomial distribution. The following sections show how to create plots similar to Figure 1 from the Tab data set.

4

3. PLOTTING CONFIDENCE LIMITS USING PROC GPLOT After you have created the Tab data set with the Mean, LCL and UCL values, it is easy to create a group confidence limit plot using PROC GPLOT.

TITLE "Average Resistance by Month"; SYMBOL1 VALUE=dot HEIGHT=1 COLOR=green INTERPOL=JOIN; SYMBOL2 VALUE=diamond HEIGHT=1 COLOR=blue; SYMBOL3 VALUE=diamond HEIGHT=1 COLOR=blue; AXIS1 label=(H=1 F='Arial' "Month") VALUE=(F=Arial H=1) OFFSET=(20,20); AXIS2 label=(H=1 F='Arial' A=90 "Mean Resistance") VALUE=(F='Arial' H=1) ORDER=18 TO 22 by 1; LEGEND1 LABEL=NONE VALUE=(H=1 "Mean" "LCL" "UCL"); PROC GPLOT DATA=Tab; PLOT (Mean LCL UCL)*Month / NAME="G_1" OVERLAY VAXIS=AXIS2 HAXIS=AXIS1 LEGEND=LEGEND1; RUN; QUIT;

The resulting plot is shown in Figure 2. Figure 2: Trend Plot with Confidence Limits Created with PROC GPLOT

ADD LINES JOINING THE UPPER AND LOWER CONFIDENCE LIMITS Figure 2 would be easier to understand if there were lines connecting the upper and lower confidence limits as in Figure 1. However there is no way for GPLOT to connect plot points for different variables in the same data set. The obvious solution is to create a new data set that has the LCL and UCL variables in the same column.

5

DATA Tab_2; SET Tab(RENAME=(LCL=CL)) Tab(RENAME=(UCL=CL)); RUN; PROC SORT DATA=Tab_2; BY Month CL; RUN;

The Tab_2 data set has a new variable, CL, containing the values of LCL and UCL. Tab_2 Month Mean CL UCL LCL 2009-10 19.36 19.20 19.52 . 2009-10 19.36 19.52 . 19.20 2009-11 20.36 20.20 20.53 . 2009-11 20.36 20.53 . 20.20 2009-12 19.64 18.18 21.09 . 2009-12 19.64 21.09 . 18.18 Now it is easy to join the LCL and UCL variables by plotting CL vs. Month.

SYMBOL2 INTERPOL=JOIN; LEGEND1 LABEL=NONE VALUE=(H=1 "Mean" "90% Confidence Limits"); PROC GPLOT DATA=Tab_2; PLOT (Mean CL)*Month / NAME="G_2" OVERLAY VAXIS=AXIS2 HAXIS=AXIS1 LEGEND=LEGEND1; RUN; QUIT;

The resulting plot (Figure 3), joins the LCL and UCL values as required, but also includes unwanted lines joining the CL values for different months. The next section shows how to use the SKIPMISS option to remove these lines. Figure 3: Trend Plot with Lines Joining LCL and UCL

6

USE THE SKIPMISS OPTION TO REMOVE THE UNWANTED LINES When the INTERPOL=JOIN option is used in a SYMBOL statement, all the points to which the SYMBOL statement applies (the CL*Month points in this example) will be joined by a line. However, if the SKIPMISS option is used in the PLOT statement, then the line will be broken at any point where there is missing data for either of the plot variables. The code below creates the Tab_3 data set, which includes a missing value for CL for each value of Month, and then uses the SKIPMISS option to create the desired plot. Figure 4 shows the resulting plot with the unwanted lines (cf. Figure 3) removed.

PROC SQL NOPRINT; CREATE TABLE Missing AS SELECT DISTINCT Month, Mean FROM Tab_2; QUIT; DATA Tab_3; set Tab_2 Missing; RUN; PROC SORT DATA=Tab_3; BY Month CL; RUN; PROC GPLOT DATA=Tab_3; PLOT (Mean CL)*Month / NAME="G_3" OVERLAY VAXIS=AXIS2 HAXIS=AXIS1 LEGEND=LEGEND1 SKIPMISS; RUN; QUIT;

Tab_3 Month Mean CL UCL LCL 2009-10 19.36 . . . 2009-10 19.36 19.20 19.52 . 2009-10 19.36 19.52 . 19.20 2009-11 20.36 . . . 2009-11 20.36 20.20 20.53 . 2009-11 20.36 20.53 . 20.20 2009-12 19.64 . . . 2009-12 19.64 18.18 21.09 . 2009-12 19.64 21.09 . 18.18 Figure 4: Improved Trend Plot Using the SKIPMISS Option of PROC GPLOT

7

ANNOTATE THE PLOT POINTS Figure 4 is almost the same as Figure 1, except that it does not include the mean values printed next to the plot points. You can use the ANNOTATE option to add these values. The first step is to create an annotate data set containing instructions for drawing objects in the graphics output area. This code creates the My_Anno data set shown below, and then uses the ANNOTATE option to write the mean values on the plot.

DATA My_Anno; SET Tab_3(WHERE=(UCL>0)); FUNCTION='LABEL'; XSYS='2'; YSYS='2'; HSYS='3'; Y= Mean; XC=Month; POSITION='6'; STYLE='"Arial"'; SIZE=2; TEXT=" Mean = " ||TRIM(LEFT(PUT(Mean, 6.2))); KEEP XSYS YSYS HSYS STYLE FUNCTION POSITION SIZE XC Y TEXT; RUN; PROC GPLOT DATA=Tab_3; PLOT (Mean CL)*Month/ NAME="G_4" OVERLAY VAXIS=AXIS2 HAXIS=AXIS1 LEGEND=LEGEND1 SKIPMISS ANNOTATE=My_Anno; RUN; QUIT;

The My_Anno data set specifies that the value of the TEXT variable be drawn at the (XC,Y) point on the graph. The book and paper by Carpenter listed in the References section provide a good introduction to the use of the ANNOTATE option. My_Anno FUNCTION XSYS YSYS HSYS Y XC POSITION STYLE SIZE TEXT LABEL 2 2 3 19.3609 2009-10 6 "Arial" 2 Mean = 19.36 LABEL 2 2 3 20.3640 2009-11 6 "Arial" 2 Mean = 20.36 LABEL 2 2 3 19.6360 2009-12 6 "Arial" 2 Mean = 19.64 Figure 5: Using the ANNOTATE Option to Label the Plot Points

8

4. PLOTTING CONFIDENCE LIMITS USING PROC SGPLOT It is much easier to create a group confidence limit plot using the SGPLOT procedure, which is new in SAS 9.2.

4.1 USING THE VLINE STATEMENT If the response variable is assumed to follow the normal distribution, then you can create the plot directly from the raw data set, Results, using a VLINE statement.

ODS GRAPHICS ON / RESET IMAGENAME="S_1"; TITLE "Resistance"; PROC SGPLOT DATA=Results; VLINE Month /RESPONSE=Resistance STAT=Mean LIMITS=BOTH LIMITSTAT=CLM ALPHA=0.10 MARKERS DATALABEL; YAXIS VALUES=(18 TO 22 BY 1); RUN;

The VLINE statement draws a line plot of Resistance vs Month. The STAT=MEAN option specifies that the mean resistance be plotted. The LIMITS=BOTH option specifies both upper and lower limits. The LIMITSTAT=CLM option specifies normal confidence limits on the mean, and the ALPHA=.10 option specifies 90% limits. The DATALABEL option causes the points to be labeled with the values of the mean resistance. Figure 6: Confidence Limit Plot Using the VLINE Statement of PROC SGPLOT

The previous example required using PROC MEANS, PROC SORT and three DATA steps before using PROC GPLOT to create essentially the same plot (shown in Figure 5), so you can see that PROC SGPLOT gives the same result with far less effort.

9

4.2 USING THE SCATTER AND SERIES STATEMENTS The VLINE statement does not have options for creating Poisson or binomial confidence limits, but you can compute the limits using PROC MEANS and a DATA step, as shown in Section 2, and then create the required plot using the SCATTER and SERIES statements. The example below uses the Resistance variable, but the same code would work for any distribution as long as the Mean, LCL and UCL variables are stored in the Tab data set.

ODS GRAPHICS ON / RESET IMAGENAME="S_2"; PROC SGPLOT DATA=Tab; SCATTER Y=Mean X=Month / YERRORLOWER=LCL YERRORUPPER=UCL LEGENDLABEL="Resistance(Mean), 90% Confidence Limits"; SERIES Y=Mean X=Month / DATALABEL=Mean LEGENDLABEL=" "; YAXIS VALUES=(18 TO 22 BY 1); RUN;

The SCATTER statement creates a scatter plot of Mean vs. Month. The YERRORLOWER and YERRORUPPER options draw limits at LCL and UCL, the 90% confidence limits on the mean. The LEGENDLABEL option creates a legend similar to that in Figure 6. The SERIES statement joins the Mean values, and the DATALABEL option labels each point with the value of Mean. Figure 7: Confidence Limit Plot Using the SCATTER and SERIES Statements of PROC SGPLOT

This method requires using PROC MEANS and one DATA step before using PROC SGPLOT, but is still much easier than the method based on PROC GPLOT. Note also that the mean values are automatically positioned either before or after, and above or below, the plot point, in order to avoid overwriting other elements of the plot area. It would require a lot more work to get the same placement using the ANNOTATE option of PROC GPLOT.

10

5. CONCLUSION Group confidence limit plots provide a very effective method for understanding the significance of group differences, and presenting these differences to others. You can use the methods shown in this paper to create confidence limit plots for variables assumed to have the normal, Poisson or binomial distribution. The plots can be created using PROC GPLOT or, with much less effort, using PROC SGPLOT. The paper by Schenker and Gentleman contains an in-depth discussion of the use of non-overlapping confidence intervals to test the significance of differences. They point out that the method is conservative compared to optimal significance tests, but agree that it is useful as a quick and convenient method for exploratory data analysis. The methods shown here are adapted from material in Chapters 5, 6 and 8 of Just Enough SAS® : A Quick-Start Guide to SAS® for Engineers.

REFERENCES Carpenter, Art. 1999. Annotate: Simply the Basics. Cary, NC: SAS Institute inc., 94 pp.

Carpenter, Arthur L. 2006. “Data Driven Annotations: An Introduction to SAS/GRAPH’s® Annotate Facility.” Proceedings of the Thirty-First Annual SAS Users Group International Conference. Paper 108-31.

Schenker, Nathaniel and Gentleman, Jane F. 2001. “On Judging the Significance of Differences by Examining the Overlap Between Confidence Intervals.” The American Statistician, August 2001, Vol. 55, No. 3. pp. 182-186.

Rutledge, Robert A. 2009. Just Enough SAS® : A Quick-Start Guide to SAS® for Engineers. Cary NC: SAS Institute Inc.

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at:

Robert A. Rutledge Sun Microsystems Work Phone: (408) 404-4321 E-mail: [email protected]

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.

Documents

Group Confidence Limit Plots Using PROC GPLOT and PROC