BENCHMARKS Ramon Zatarain. INDEX Benchmarks and Benchmarking Relation of Benchmarks with Empirical Methods Benchmark definition Types of benchmarks Benchmark

BENCHMARKS

Ramon Zatarain

INDEX

• Benchmarks and Benchmarking• Relation of Benchmarks with Empirical Methods• Benchmark definition• Types of benchmarks• Benchmark suites• Measuring performance (CPU, comparing of performance, etc.)• Common system benchmarks• Examples of software benchmarks• Benchmark pitfalls• Recommendations• Benchmarking rules• Bibliography

Benchmarks and Benchmarking

• A benchmark was a reference point in determining one’s current position or altitude in topographical surveys and tidal observations.

• A benchmark was an standard against which others could be measured.


• In the 1970s, the concept of a benchmark evolved beyond a technical term signifying a reference point. The word migrated into the lexicon of business, where it came to signify the measurement process by which to conduct comparisons.


• In the early 1980s, Xerox corporation, a leader in benchmarking, define it as the continuous process of measuring products, services, and practices against the toughest competitors.

• Benchmarks, in contrast to benchmarking, are measurements to evaluate the performance of a function, operation or business relative to others.

• In the electronic industry, for instance, a benchmark has long referred to an operating statistics that allows you to compare your own performance to that of another.


RELATION OF BENCHMARKS WITH EMPIRICAL METHODS

• In many areas of Computer sciences, experiments are the primary means of demonstrating the potential and value of systems and techniques.

• empirical methods for analysing and comparing systems and techniques are of considerable interest to many CS researchers.

• The main evaluation criteria that has been adopted in some fields, like the satisfiability testing (SAT), is empirical performance on shared benchmark problems.

• In the seminar “Future Directions in Software Engineering”, many issues were addressed; some of them were:


• In the paper “Research Methodology in Software Engineering” four methodologies were identified: the scientific method, the engineering method, the empirical method, and the analytical method.

• In paper “We Need To Measure The Quality Of Our Work” the author point out that “we as a

community have no generally accepted methods or benchmarks for measuring and comparing the quality and utility of our research results”.



Examples:

• IEEE Computer Society Workshop on Empirical Evaluation of Computer Vision Algorithms. – A benchmark for graphics recognition

systems

• An empirical comparison of C, C++, Java, Perl, Python, Rexx, and Tcl (hyperlink)

BENCHMARK DEFINITION

Some definitions are:

• It is a test that measures the performance of a system or subsystem on a well-defined task or set of task.

• A method of comparing the performance of different computer architecture.

• Or a method of comparing the performance of different software

TYPES OF BENCHMARKS

• Real programs. They have input, output, and options that a user can select when running the program.

Examples: Compilers, text processing software, etc.

• Kernels. Small, key pieces from real programs. They are not used for users.

Examples: Livermore Loops and Linpack.

TYPES OF BENCHMARKS

• Toy benchmarks. Typically between 10 and 100 lines of code and produce a result the user already knows.

Examples: Sieve of Eratosthenes, Puzzle, and Quicksort.

• Synthetic benchmarks: They try to match an average execution profile.

Examples: Whetstone and Dhrystone.

BENCHMARK SUITES

• It is a collection of benchmarks to try to measures the performance of processors with a variety of applications.

• The advantage is that the weakness of any one benchmark is lessened by the presence of the other benchmarks.

• Some benchmarks of the suite are kernels, but many are real programs.

BENCHMARK SUITESExample: SPEC92 benchmark suite (20 programs)

Benchmark Source Lines of code description

Espresso C 13,500 Minimize Boolean functionsLi C 7,413 Lisp interpreter (9 queen probl.)Eqntott C 3,376 translate boolean equationsCompress C 1,503 Data compressionSc C 8,116 Computation in a spreadsheetGcc C 83,589 GNU C compilerSpice2g6 Fortran 18,476 Circuit Simulation PackageDoduc Fortran 5,334 Simulation of nuclear reactorMdljdp2 Fortran 4,458 Chemical applicationWave5 Fortran 7,628 Electromagnetic Simulation Tomcatv Fortran 195 Mesh generation programOra Fortran 535 Traces rays through optical syst.Alvinn C 272 Simulation in neural networksEar C 4,483 Inner ear model……

MEASURING PERFORMANCE

• Wall-clock time (elapsed time). Latency to complete a task, including disk accesses, input/output activities, memory accesses, OS overhead.

• CPU time. Not inclusion of time waiting for I/O or running another program.

• User CPU time. Time spent in the program

• System CPU time. Time spent in the OS

CPU Performance Measures

• MIPS (millions of instructions per second). How fast the machine can operate. MFLOPS (Floating-point).

• GFLOPS (Gigaflops).• Other measures are Whets (Whetstone

benchmark), VUP (VAX unit of performance), and SPECmarks.

Note: Sometimes MIPS can mean “meaningless indicators of performance for salesmen”.

COMPARING PERFORMANCE

Program P1 (secs)

Program P2 (secs)

Program P3 (secs)

Computer A Computer B Computer C

1 10 20

1000 100 20

1001 110 40

Execution times of three programs on three machines

CPU Performance MeasuresTOTAL EXECUTION TIME:An average of the execution times that tracks total execution time is the arithmeticmean

ni=1

Timei

nWhere Timei is the execution for the ith program of a total of n in the workload

When performance is expressed as a rate we use Harmonic mean:

1

n

i=1Ratei

n

Where Ratei is a function of 1/timei , the execution time for the ith of n programsin the workload. It is used when performanceIs measured in MIPS or MFLOPS


WEIGHTED EXECUTION TIMEA question arises: What is the proper mixture of programs for the workload?In the arithmetic mean we assume programs P1 and P2 run equally in theWorkload.

A weighted arithmetic mean is given by

n

i=1

Weighti x Timei Where Weighti is the frequency of theith program in the workload and Timei

Is the execution time of the program ‘i’


Program P1 (secs)

Program P2 (secs)

Arithmetic mean:W1

Arithmetic mean:W2

Comp A Comp B Comp C W1 W2 W3

1 10 20 .50 .909 .999

1000 100 20 .50 .091 .001

500.50 55.0 20.0

91.91 18.19 20.0

Arithmetic mean:W3 2.0 10.09 20.0

Weighted arithmetic mean execution times using three weightings

COMMON SYSTEM BENCHMARKS

007 (OODBMS).

Designed to simulate a CAD/CAM environment. Tests: - Pointer traversals over cached data; disk resident data; sparse traversals; and dense traversals - Updates: indexed and unindexed object fields; repeated updates; sparse updates; updates of cached data; and

creation and deletion of objects - Queries: exact match lookup; ranges; collection scan; path-join; ad-hoc join; and single-level make. Originator: University of Wisconsin Versions: Unknown Availability of Source: Free from ftp.cs.wisc.edu:/007 Availability of Results: Free from ftp.cs.wisc.edu:/007 Entry Last Updated: Thursday April 15 15:08:07 1993

AIM

AIM Technology, Palo Alto Two suites (III and V) Suite III: simulation of applications (task- or device specific)

- Task specific routines (word processing, database management, accounting) - Device specific routines (memory, disk, MFLOPs, IOs) - All measurements represent a percentage of VAX 11/780 performance (100%) In general, Suite III gives an overall indication of performance.

Suite V: measures throughput in a multitasking workstation environment by testing: - Incremental system loading - Multiple aspects of system performance The graphically displayed results plot the workload level versus time. Several different models characterize various user environments (financial, publishing, software engineering). The published reports are copyrighted.

An example of AIM benchmark results (in .pdf format)

http://www.nswc.navy.mil/cosip/may98/aim.pdf

Dhrystone Short synthetic benchmark program intended to be representative of system(integer) programming. Based on published statistics on use of programming language features; see original publication in CACM 27,10 (Oct. 1984), 1013-1030.Originally published in Ada, now mostly used in C. Version 2 (in C) published in SIGPLAN Notices 23,8 (Aug. 1988), 49-62, together with measurement rules. Version 1 is no longer recommended since state-of-the-art compilers can eliminatetoo much "dead code" from the benchmark (However, quoted MIPS numbers areoften based on Version 1.) Problems: Due to its small size (100 HLL statements, 1-1.5 KB code), the memorysystem outside the cache is not tested; compilers can too easily optimize forDhrystone; and string operations are somewhat over represented.Recommendation: Use it for controlled experiments only; don't blindly trust single Dhrystone MIPS numbers quoted somewhere (as a rule, don't do this for anybenchmark). Originator: Reinhold Weicker, Siemens Nixdorf ([email protected]) Versions in C: 1.0, 1.1, 2.0, 2.1 (final version, minor corrections compared with 2.0) See also: R.P.Weicker, A Detailed Look ... (see Publications, 4.3) Availability of source: [email protected], ftp.nosc.mil:pub/aburto Availability of results (no guarantee of correctness): Same as above

Khornerstone

Multipurpose benchmark used in various periodicals. Originator: Workstation Labs Versions: unknown Availability of Source: not free Availability of Results: UNIX Review

LINPACK Kernel benchmark developed from the "LINPACK" package of linear algebraroutines. Originally written and commonly used in FORTRAN; a C version also exists. Almost all of the benchmark's time is spent in a subroutine ("saxpy" in the single-precision version, "daxpy" in the double-precision version) doing theinner loop for frequent matrix operations: y(i) = y(i) + a * x(i) The standard version operates on 100x100 matrices; there are also versions for sizes 300x300and 1000x1000, with different optimization rules. Problems: Code is representative only for this type of computation. LINPACKis easily vectorizable on most systems. Originator: Jack Dongarra, Computer Science Deptartment, University of Tennessee, [email protected]

MUSBUS

Designed by Ken J. McDonell at the Monash University in Australia a very good benchmark of disk throughput and the multi-user simulation.Compile, create the directories and the workload for simulated users, andexecute the simulation three times by measuring cpu and elapsed time. The workload is constituted by 11 commands (cc, rm, ed, ls, cp, spell, cat,mkdir, export, chmod, and a nroff-like spooler) and 5 programs (syscall,randmem, hanoi, pipe, and fstime). This is a very complete test which is asignificant measurement of the CPU speed, C compiler and UNIX quality, file system performances and multi-user capabilities, disk throughput, and memory management implementation.

Nhfsstone

A benchmark intended to measure the performance of file servers that followthe NFS protocol. The work in this area continued within the LADDIS groupand finally within SPEC. The SPEC benchmark 097.LADDIS is intended toreplace Nhfsstone. It is superior to Nhfsstone in several aspects (multi-client capability, less client sensitivity).

SPEC SPEC stands for Standard Performance Evaluation Corporation, a non-profitorganization whose goal is to "establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation ofhigh performance computers" (from SPEC's bylaws). The SPEC benchmarksand more information can be obtained from: SPEC [Standard Performance Evaluation Corporation] c/o NCGA [National Computer Graphics Association] 2722 Merrilee Drive Suite 200 Fairfax, VA 22031 USA Phone: +1-703-698-9600 Ext. 325 FAX: +1-703-560-2752 E-Mail: [email protected] The current SPEC benchmark suites are: CINT92 (CPU intensive integer benchmarks) CFP92 (CPU intensive floating point benchmarks) SDM (UNIX Software Development Workloads) SFS (System level file server (NFS) workload) example: SPEC

SSBA

The SSBA is the result of the studies of the AFUU (French Association ofUNIX Users) Benchmark Working Group. This group, consisting of some 30 active members of varied origins (universities, public and private research,manufacturers, end users), has assigned itself the task of assessing theperformance of data processing systems, collecting a maximum numberof tests available throughout the world, dissecting the codes and results, discussing the utility, fixing versions, and supplying them with various comments and procedures.

A sample output of the SSBA suite of UNIX benchmark tests

http://www.nswc.navy.mil/cosip/may98/ssba.html

Sieve of Eratosthenes An integer program that generates prime numbers using a method known as the Sieve of Eratosthenes.

TPC

TPC-A is a standardization of the Debit/Credit benchmark which was first publishedin DATAMATION in 1985. It is based on a single, simple, update-intensivetransaction which performs three updates and one insert across four tables.Transactions originate from terminals, with a requirement of 100 bytes in and200 bytes out. There is a fixed scaling between tps rate, terminals, and database size. TPC-A requires an external RTE (remote terminal emulator) to drive the SUT (system under test). The system performs five kinds of transactions: entering a new order, delivering orders, posting customer payments, retrieving a customer's most recent order, and monitoring the inventory level of recently ordereditems

WPI Benchmark Suite

The first major synthetic benchmark program, intended to be representative for numerical (floating point intensive) programming. Based on statistics gathered at National Physical Laboratory in England, using an Algol 60 compilerwhich translated Algol into instructions for the imaginary Whetstone machine. The compilation system was named after the small town outside the City ofLeicester, England, where it was designed (Whetstone). Problems: Due to the small size of its modules, the memory system outsidethe cache is not tested; compilers can too easily optimize for Whetstone;mathematical library functions are over represented. Originator: Brian Wichmann, NPL

Whetstone

One of the first and very popular benchmarks, the WHETSTONE was originally published in 1976 by Curnow and Wichman in algol and subsequently translated into FORTRAN. This synthetic mix of elementaryWhetstone instructions is modeled with statistics from about 1000 scientificand engineering applications. The WHETSTONE is rather small and, dueto its straightforward coding, may be prone to particular (and unintentional)treatment by intelligent compilers. It is very sensitive to the transcendentaland trigonometric functions processing, and completely dependent on fastor additional mathematics coprocessor. The WHETSTONE is a goodpredictor for engineering and scientific applications.

SYSmark

SYSmark93 provides benchmarks that can be used to measure performance of IBM PC-compatible hardware for the tasks usersperform on a regular basis. SYSmark93 benchmarks representthe workloads of popular programs in such applications as wordprocessing, spreadsheets, database, desktop graphics, and software development.

Stanford

A collection of C routines developed in 1988 at Stanford University(J. Hennessy, P. Nye). Its two modules, Stanford Integer and StanfordFloating Point, provide a baseline for comparisons between Reduced Instruction Set (RISC) and Complex Instruction Set (CISC) processorarchitectures Stanford Integer: - Eight applications (integer matrix multiplication, sorting algorithm [quick, bubble, tree], permutation, hanoi, 8 queens puzzle) Stanford Floating Point: - Two applications (Fast Fourier Transform [FFT] and matrix multiplication) The characteristics of the programs vary, but most of them have array accesses. There seems to be no official publication (only a printing in aperformance report). Secondly, there is no defined weighting of theresults (Sun and MIPS compute the geometric mean).

Bonnie

This is a file system benchmark that attempts to study bottlenecks.Specifically, these are the types of filesystem activity that have been observed to be bottlenecks in I/O-intensive applications, in particularthe text database work done in connection with the New OxfordEnglish Dictionary Project at the University of Waterloo. It performs a series of tests on a file of known size. By default, that size is 100 Mb (but that's not enough - see below). For each test, Bonniereports the bytes processed per elapsed second, per CPU second,and the percent CPU usage (user and system). In each case, an attempt is made to keep optimizers from noticing it's all bogus. The idea is to make sure that these are real transfers to/from userspace to the physical disk.

IOBENCH IOBENCH is a multi-stream benchmark that uses a controlling process(iobench) to start, coordinate, and measure a number of "user" processes(iouser); the Makefile parameters used for the SPEC version of IOBENCHcause ioserver to be built as a "do nothing" process.

IOZONE This test writes an X MB sequential file in Y byte chunks, then rewinds it and reads it back. [The size of the file should be big enough to factor outthe effect of any disk cache.] Finally, IOZONE deletes the temporary file. The file is written (filling any cache buffers), and then read. If the cache is >= X MB, then most if not all of the reads will be satisfiedfrom the cache. However, if the cache is <= .5X MB, then NONE of thereads will be satisfied from the cache. This is because after the file is written, a .5X MB cache will contain the upper .5 MB of the test file, but we will startreading from the beginning of the file (data which is no longer in the cache).In order for this to be a fair test, the length of the test file must be AT LEAST 2X the amount of disk cache memory for your system. If not, you are reallytesting the speed at which your CPU can read blocks out of the cache (not a fair test).

Byte This famous test taken from Byte (1984), originally targeted at microcomputers,is a benchmark suite similar in spirit to SPEC, except that it is smaller andcontains mostly things like "Sieve of Eratosthenes" and "Dhrystone". If you are comparing different UNIX machines for performance, this gives fairly good numbers.

Netperf A networking performance benchmark/tool. Includes throughput (bandwidth)and request/response (latency) tests for TCP and UDP using theBSD sockets API, DLPI, UNIX Domain Sockets, the Fore ATM API, and HP HiPPI Link Level Access. See ftp://ftp.cup.hp.com/dist/networking/benchmarks and ftp://sgi.com

Nettest A network performance analysis tool developed at Cray.

TTCP TTCP is a benchmarking tool for determining TCP and UDP performance between two systems. TTCP times the transmission and reception of data between two systems using the UDP or TCP protocols. It differs from common "blast" tests, which tend to measure the remote Internet daemon (inetd) as much as the network performance, and which usually do not allow measurements at the remote end of a UDP transmission. This program was created at the US Army Ballistics Research Laboratory (BRL).

CPU2 The CPU2 benchmark was invented by Digital Review (now Digital News and Review). To quote DEC, describing DN&R's benchmark, CPU2 "...is a floating point intensive series of FORTRAN programs and consists of thirty-four separate tests. The benchmark is most relevant in predicting the performance of engineering and scientific applications. Performance is expressed as a multiple of MicroVAX II Units of Performance. The CPU2 benchmark is available via anonymous ftp from swedishchef.lerc.nasa.gov in the drlabs/cpu directory. Get cpu2.unix.tar.Z for unix systems or cpu2.vms.tar.Z for VMS systems."

Hartstone Hartstone is a benchmark for measuring various aspects of hard real time systems from the Software Engineering Institute at Carnegie Mellon.

PC Bench/WinBench/NetBench PC Bench 9.0, WinBench 95 Version 1.0, Winstone 95 Version 1.0, MacBench 2.0, NetBench 3.01, and ServerBench 2.0 are the current names and versions of the benchmarks available from the Ziff-Davis Benchmark Operation (ZDBOp)

Sim An integer program that compares DNA segments for similarity.

Fhourstones A small integer-only program that solves positions in the game of connect-4using exhaustive search with a very large transposition table. Written in C.

Heapsort An integer program that uses the "heap sort" method of sorting a random array of long integers up to 2 MB in size.

Hanoi An integer program that solves the Towers of Hanoi puzzle using recursivefunction calls.

Flops C Estimates MFLOPS rating for specific floating point add, subtract, multiply,and divide (FADD, FSUB, FMUL, and FDIV) instruction mixes. Four distinctMFLOPS ratings are provided based on the FDIV weightings from 25% to 0% and using register-to-register operations. Works with both scalar and vectormachines.

C LINPACK The LINPACK floating point program converted to C.

TFFTDP This program performs FFTs using the Duhamel-Hollman method for FFTs from 32 to 262,144 points in size.

Matrix Multiply (MM) This program contains nine different algorithms for doing matrixmultiplication (500 X 500 standard size). Results illustrate the effectsof cache thrashing versus algorithm, machine, compiler, and compiler options.

EXAMPLES OF SOFTWARE BENCHMARKS

• A benchmark of Java: LINK

• A benchmark of a Iota+ compiler: LINK

• A benchmark of Java/c++: LINK

• A benchmark of C++: LINK

• A benchmark of SML: LINK

http://www-cse.ucsd.edu/users/wgg/JavaProf/javaprof.html

http://www.cs.cornell.edu/Courses/cs412/2001sp/hw/bench/results.html

http://www-cs-students.stanford.edu/~balee/courses/cs240/benchmark.html

BENCHMARKING PITFALLS?

• Optimization option on today’s compilers can affect the results of benchmark tests.

• Modification of the sources (public-domain software) produces different versions of the benchmark.

• Many benchmarks are one-dimensional in nature (test only one aspect of a system).

different aspects to test are: CPU, I/O, File System, etc.

• A compiler can “recognize” a benchmark suite and loads a hand-optimized algorithms for the test.

BENCHMARKING PITFALLS?

RECOMMENDATIONS

• A user should determine which aspects of system or component performance are to be measured.

• Determine the best source of benchmark suites or performance data (either public-domain or licensed third-party packages).

• Ensure that all system-hardware and OS parameters during benchmark comparisons equate as closely as possible.

• Understand what specific benchmark tests measure and what causes the results to vary.

BENCHMARKING RULES (Example in Neural networks)

• describe and standardize ways of setting up experiments, documenting these setups, measuring results, and documenting

these results (goal: maximize comparability of experimental results)• Problem: name, address, version/variant.• Training set, validation set, test set.• Network: nodes, connections, activation functions.• Initialization.• Algorithm parameters and parameter adaption rules.• Termination, phase transition, and restarting criteria.• Error function and its normalization on the results reported.• Number of runs, rules for including or excluding runs in results

reported.

BIBLIOGRAPHY

• D. A. Patterson and J.L. Hennessy. Computer Architecture a quantitative approach. Morgan Kaufman publishers, inc., second edition, 1996.

• R. Baron and L. Higbie. Computer Architecture. Addison-Wesley,1994.• J.P. Hayes. Computer Architecture and Organization. McGraw-Hill,1998.• W.J. Price. A Benchmark Tutorial. IEEE micro, October 1989 (28-43).• L. Prechlet. PROBEN1- A set of neural network benchmark problems and benchmarking rules.

Technical Report 21/94, 38 pages, Fakultät für Informatik, Universität Karlsruhe, September 1994.

• Lutz Prechelt. An empirical comparison of seven programming languages. IEEE Computer 33(10):23-29, October 2000.

• Lutz Prechelt. Some Notes on Neural Learning Algorithm Benchmarking. Neurocomputing 9(3):343-347, December 1995.

• J. Dongarra, J. L. Martin, and J. Worlton. Computer Benchmarking: Paths and Pitfalls. IEEE Spectrum 24(7):38-43, July 1987.

• Bruce McCormick. Benchmarking. http://www.nswc.navy.mil/cosip/may98/cots0598-1.shtml• Web page: www.BenchmarkingReports.com/book• Hagan School of Business. Competitive benchmarking.

www.iona.edu/faculty/jalstete/MNG992/document.htm• B. Marks. System Benchmarks.

www.cse.dmu.ac.uk/~bb/Teaching/ComputerSystems/SystemBenchmarks• Benchmarks FAQ version 0.6 www.sysopt.com/benchfaq.html

Documents

BENCHMARKS Ramon Zatarain. INDEX Benchmarks and Benchmarking Relation of Benchmarks with Empirical Methods Benchmark definition Types of benchmarks Benchmark