On the eve of the 21st century: statistical science at a crossroads

Computational Statistics & Data Analysis 32 (2000) 239–243www.elsevier.com/locate/csda

On the eve of the 21st century:statistical science at a crossroads

Edward J. Wegman

Center for Computational Statistics, George Mason University, MS 4A7, Fairfax,VA 22030-4444, USA

1. Introduction

Modern statistical science is one of the major scienti�c achievements of the 20thcentury. While many of the concepts had been laid down earlier, it was not untilthe 20th century that a true science of statistics had come into being. It has beenmy privilege over the course of my professional life to have known many of theseminal �gures in the history of modern 20th century statistics. Regrettably, I didnot have a chance to know Karl Pearson and R.A. Fisher, but I did know manyof this century’s greatest contributors to statistical theory such as Egon Pearson,Jerzy Neyman, David Blackwell, C.R. Rao, Harold Hotelling, R.C. Bose, WassilyHoe�ding, Gertrude Cox, Jimmie Savage, and Harald Cramer to name a few. Ofcourse, many great contributors are still active today. I often contemplate what theseearlier innovators of the 20th century might think about the state of statistical scienceas we approach the 21st century. The computing revolution we have seen in the last20 years has thrust upon us great changes in the way we collect data and our generalattitude towards both data and methodology. In a real sense, statistical science is ata crossroads. Statistics as a discipline, it seems to me, is all too often de�ned by theset of techniques, tools and methodology rather than by the goal of data analysis andinference. De�ning the science by its techniques, tools and methods is a prescriptionfor insularity while de�ning by its goals is a prescription for openness and expansion.I rather think the great innovators of statistical methodology mentioned above wouldopt for a rather more goal oriented worldview of statistical science than seems toprevail today in general.

E-mail address: [email protected] (E.J. Wegman)

0167-9473/00/$ - see front matter c© 2000 Elsevier Science B.V. All rights reserved.PII: S 0167-9473(99)00078-X

240 E.J. Wegman /Computational Statistics & Data Analysis 32 (2000) 239–243

This short essay is intended to set the framework for this special issue of Com-putational Science and Data Analysis. A fundamental premise of the special issueand the IASC World Congress from which it arises is that as we move into the21st century, great new challenges for statisticians and data analysts lie ahead ofus. The special issue is intended to capture some of these challenges. As I see it,data acquisition and data types are changing and data analysis methodology mustcorrespondingly change.

2. How data are changing

My fundamental premise is that new data types will give rise to new methodol-ogy. Traditional i.i.d. or homogeneous, low-dimensional, relatively small sample sizedata structures arising from designed experiments are the commonly assumed datastructures that have given rise to traditional mathematical statistics. I have arguedelsewhere, Wegman (1988a,b), that the character of data being collected is changingdramatically and that this has profound implications on the types of analysis we cando. This di�erence gives rise to what I then called computational statistics.The computational and inferential implications of larger and higher dimensional

data sets became clearer to me when I �rst heard of Peter Huber’s taxonomy of dataset sizes Huber (1992, 1994) to which I have added in a modest fashion Wegman(1995). It has been my thesis that, although statistical theory often takes the formof asymptotics, in fact, when statisticians tend to think of data, they tend to thinkof fairly small data sizes, say 100 to perhaps 10,000 observations. These are whatthe Huber taxonomy characterizes as tiny and small data sets. In Wegman (1995),I argue that conventional statistical methodologies begin to break down around 106

observations. By 108 observations, interactive techniques become infeasible for mostall but the simplest algorithms. Also in this range of data set sizes, the feasibilityof transferring data over common networks and even storing data in RAM mem-ory becomes marginal. I also argue that beyond 106 observations, the human abilityat visualization becomes problematic. Even supercomputers or tera op-capable nextgeneration computers o�er little relief, especially with algorithms of complexity be-yond O(n3=2).Some simple numerical illustrations are actually quite compelling. Consider for

example an O(n2) clustering algorithm applied to a massive data set (1012 observa-tions). This would require O(1024) computations which on a tera op computer (1012

computations per second) would require 1012 s or approximately 105 years. Clearlythis is prohibitive. Standard ethernet operates at a maximum of 10 megabits persecond. That same massive dataset would require 106 s or somewhat more than 1month to transfer over standard ethernet operating at maximal e�ciency. The humaneye contains approximately 107 cones. Even with the visualization capability of oneobservation per cone our eyes would be hopelessly overloaded. A massive datasetwould require us to visualize 105 observations per cone.So if gigabyte and larger data sets and O(n3=2) complexity algorithms are problems,

then to what extent are these factors appearing in real data? The answer is that they

E.J. Wegman /Computational Statistics & Data Analysis 32 (2000) 239–243 241

appear fairly commonly. More and more data are being collected for one purpose, butattempts are made to exploit the data for other purposes. Airline booking transactions,point-of-sale commercial purchases, bank transactions, and telephone call records arejust a few such commercial databases that one might wish to exploit. In the scienti�crealm, catalogs of celestial objects, data from satellite remote sensing of the Earth,text and multimedia data exploitation from internet usage, ultrasound nondestructiveevaluation data, radar data used in air tra�c control, and image understanding andexploitation are just a few examples for which data quickly accumulate into theterabyte and higher range.Moreover, data structures are becoming more complex, not only by being multi-

variate, but also by being categorical or mixed data types, by being incomplete, bybeing heavily quantized (so subject to roundo�), or by being non-numerical, for ex-ample, with the human genome data. Algorithms that work well for tiny or small datasets, become unfeasible with more massive data. Clustering algorithms are usuallyof O(n2) complexity, while robust multivariate location and scale algorithms such asMVE or MCD are of exponential complexity. Thus there are real data sets and realalgorithms which challenge the limits of computation, of data transfer, and of visu-alization. These are not mere �ctions but challenges that statisticians must address.It is clear that the changing nature of data challenges us in ways that traditionalstatistical analysis does not even consider.

3. Whither statistics?

When I was a youngster in the statistics profession, I often heard the elders ofour profession espouse the philosophical perspective that statistics is the guardianof the scienti�c method. This position arises from a paradigm in which the model ofscienti�c behavior is to study a phenomenon, make a model of the phenomenon,collect data about the phenomenon, test the model using the data, re�ne the model,retest the model repeating the process until the model is adequate. This paradigmof course is captured pretty well by classical statistical hypothesis testing, hence theleap to call statistical science the guardian of the scienti�c method. This view ofstatistics is still held by many in some sense of lofty idealism.The trouble with this position is that as often as not, real scientists do not begin to

operate with this so-called “scienti�c method”. Moreover, this view of statistical sci-ence from my perspective does not capture the essence of what our discipline shouldbe about. The focus of our discipline I believe should be on data and inferences tobe made from data. If the nature of data is changing, then the methods for analyzingand making inferences from that data must correspondingly change. I repeat that I donot believe by any stretch of the imagination that all traditional methods of statisticsshould be abandoned. Quite the contrary, many traditional methods are extremelyvaluable and will continue to be employed for the foreseeable future. However, newdata types require new methods and techniques.So how will statistics as a methodology and a discipline evolve? First, let me com-

ment that I believe many of the traditional dichotomies will become anachronisms.

242 E.J. Wegman /Computational Statistics & Data Analysis 32 (2000) 239–243

The Bayesian versus classical perspective will essentially disappear. Both of theseapproaches tend to refer to parametric techniques and parametric techniques are poorat coping with really large-scale data. Similarly, nonparametric versus parametrictechniques still refer to model-based views of data. For many purposes models areunnecessary if the data speak in such a compelling fashion. If the data are notcollected according to probabilistic sampling then both parametric and nonparametricstatistical models are essentially irrelevant except as a heuristic tool.I hope statistics as a discipline will embrace a larger view of the �eld and will

take data, rather than methodology, to be the fundamental common denominatorof the discipline. With this view, not only traditional statistics and probability arethe focus of the discipline, but also topics like data mining, image analysis, patternrecognition, databases, and related computational methods become the fundamentalfeatures of the discipline. It must be clear that I favor a view of statistical sciencewhich is open and inclusive. Such a view implies that statisticians must embrace awider set of techniques, tools and methods than is common today. I fear that if wedo not do so, statistics like classical Newtonian mechanics, will become a usefulpractical tool but no longer a venue for exciting research and innovation.

4. The special issue

The Second World Congress on Computational Statistics and Data Analysis washeld in Pasadena, California on February 19 through 22, 1997. The theme was Com-putational Statistics and Data Analysis : : : on the Eve of the 21st Century. Itis entirely �tting that this special issue appears just as the 21st century begins.The Congress itself was sponsored by the International Association for StatisticalComputing and was organized with a distinctly international avor. Invited speak-ers, contributed speakers and participants came from the Americas, Europe, Aus-tralia, and Asia. Themes ranged widely from biomedical and pharmacological ap-plications, computational databases, education using the world wide web, graphicalmethods, traditional statistical modeling and a host of other topics apropos to modernstatistics.Proceedings (Wegman and Azen, 1997) were produced and have been available

for some time. However, the intent of the special issue was to highlight material ofspecial interest. An attempt was made to balance regional participation, topical areas,traditional modeling versus more data analytic approaches, invited versus contributedpapers, methodological versus applied papers, and a host of other competing factors.In the end, many practicalities temper the construction of the special issue not theleast of which is space allotted for the special issue in the journal. Of course, qualityof presentation was a major concern in developing the special issue as well asresponsiveness of the potential contributors. There were many excellent candidatesfor inclusion in this special issue and absence of a paper should not be interpretedas pejorative to the quality of the submission in any sense. Indeed, many of thesepapers will appear in the regular refereed literature. The special issue is to a largeextent a personal selection by this editor of what he found to be of interest and

E.J. Wegman /Computational Statistics & Data Analysis 32 (2000) 239–243 243

represents an eclectic mix of papers which I hope re ect the diversity of statisticaland data analytic interests on the eve of the 21st century.

Acknowledgements

My deep gratitude goes to my friend, Professor Stanley Azen, my co-Chair of theScienti�c Program Committee and Chair of the Local Organizing Committee. Othermembers of the Local Organizing Committee included Ernest Scheuer, A. A. A��,Joyce Niland, John Rolph and Simon Tavare. Other members of my Scienti�c Pro-gram Committee included Murray Cameron, David J. Hand, Haruo H. Onishi andGilbert Saporta. I am grateful to the authors in this special issue who have patientlyawaited referee’s reports and who have responded with humor and grace to thetedious revision process. My editorial work and the preparation of this essay weresupported by the Army Research O�ce under Grant DAAG55-98-1-0404, by the Of-�ce of Naval Research under Grant DAAD19-99-1-0314 administered by the ArmyResearch O�ce, by the National Science Foundation under a Group InfrastructureGrant DMS-9631351, and the Defense Advanced Research Projects Agency underAgreement 8905-48174 with The Johns-Hopkins University.

References

Huber, P.J., 1992. Issues in computational data analysis. In: Dodge, Y., Whittaker, J. (Eds.),Computational Statistics, Vol. 2. Physica Verlag, Heidelberg.

Huber, P.J., 1994. Huge data sets. In: Dutter, R., Grossmann, W. (Eds.), Compstat 1994: Proceedings.Physica Verlag, Heidelberg.

Wegman, E.J., 1988a. A view of computational statistics and its curriculum, American StatisticalAssociation Proceedings of the Section on Statistical Education, pp. 1–6.

Wegman, E.J., 1988b. Computational statistics: a new agenda for statistical theory and practice.J. Washington Acad. Sci. 78, 310 –322.

Wegman, E.J., 1995. Huge data sets and the frontiers of computational feasibility. J. Comput. GraphicalStatist. 4 (4), 281–295.

Wegman, E.J., Azen, S. (Eds.), 1997. Computing science and statistics: Proceedings of the SecondWorld Congress of the IASC, Fairfax Station, VA, Interface Foundation of North America.

Documents

On the eve of the 21st century: statistical science at a crossroads