of 46/46
BENCHMARKS Ramon Zatarain

BENCHMARKS Ramon Zatarain. INDEX Benchmarks and Benchmarking Relation of Benchmarks with Empirical Methods Benchmark definition Types of benchmarks Benchmark

  • View
    246

  • Download
    5

Embed Size (px)

Text of BENCHMARKS Ramon Zatarain. INDEX Benchmarks and Benchmarking Relation of Benchmarks with Empirical...

  • Slide 1
  • BENCHMARKS Ramon Zatarain
  • Slide 2
  • INDEX Benchmarks and Benchmarking Relation of Benchmarks with Empirical Methods Benchmark definition Types of benchmarks Benchmark suites Measuring performance (CPU, comparing of performance, etc.) Common system benchmarks Examples of software benchmarks Benchmark pitfalls Recommendations Benchmarking rules Bibliography
  • Slide 3
  • Benchmarks and Benchmarking A benchmark was a reference point in determining ones current position or altitude in topographical surveys and tidal observations. A benchmark was an standard against which others could be measured.
  • Slide 4
  • Benchmarks and Benchmarking In the 1970s, the concept of a benchmark evolved beyond a technical term signifying a reference point. The word migrated into the lexicon of business, where it came to signify the measurement process by which to conduct comparisons.
  • Slide 5
  • Benchmarks and Benchmarking In the early 1980s, Xerox corporation, a leader in benchmarking, define it as the continuous process of measuring products, services, and practices against the toughest competitors.
  • Slide 6
  • Benchmarks, in contrast to benchmarking, are measurements to evaluate the performance of a function, operation or business relative to others. In the electronic industry, for instance, a benchmark has long referred to an operating statistics that allows you to compare your own performance to that of another. Benchmarks and Benchmarking
  • Slide 7
  • RELATION OF BENCHMARKS WITH EMPIRICAL METHODS In many areas of Computer sciences, experiments are the primary means of demonstrating the potential and value of systems and techniques. empirical methods for analysing and comparing systems and techniques are of considerable interest to many CS researchers.
  • Slide 8
  • The main evaluation criteria that has been adopted in some fields, like the satisfiability testing (SAT), is empirical performance on shared benchmark problems. In the seminar Future Directions in Software Engineering, many issues were addressed; some of them were: RELATION OF BENCHMARKS WITH EMPIRICAL METHODS
  • Slide 9
  • In the paper Research Methodology in Software Engineering four methodologies were identified: the scientific method, the engineering method, the empirical method, and the analytical method. In paper We Need To Measure The Quality Of Our Work the author point out that we as a community have no generally accepted methods or benchmarks for measuring and comparing the quality and utility of our research results. RELATION OF BENCHMARKS WITH EMPIRICAL METHODS
  • Slide 10
  • Examples: IEEE Computer Society Workshop on Empirical Evaluation of Computer Vision Algorithms. A benchmark for graphics recognition systems An empirical comparison of C, C++, Java, Perl, Python, Rexx, and Tcl (hyperlink)(hyperlink)
  • Slide 11
  • BENCHMARK DEFINITION Some definitions are: It is a test that measures the performance of a system or subsystem on a well- defined task or set of task. A method of comparing the performance of different computer architecture. Or a method of comparing the performance of different software
  • Slide 12
  • TYPES OF BENCHMARKS Real programs. They have input, output, and options that a user can select when running the program. Examples: Compilers, text processing software, etc. Kernels. Small, key pieces from real programs. They are not used for users. Examples: Livermore Loops and Linpack.
  • Slide 13
  • TYPES OF BENCHMARKS Toy benchmarks. Typically between 10 and 100 lines of code and produce a result the user already knows. Examples: Sieve of Eratosthenes, Puzzle, and Quicksort. Synthetic benchmarks: They try to match an average execution profile. Examples: Whetstone and Dhrystone.
  • Slide 14
  • BENCHMARK SUITES It is a collection of benchmarks to try to measures the performance of processors with a variety of applications. The advantage is that the weakness of any one benchmark is lessened by the presence of the other benchmarks. Some benchmarks of the suite are kernels, but many are real programs.
  • Slide 15
  • BENCHMARK SUITES Example: SPEC92 benchmark suite (20 programs) BenchmarkSourceLines of codedescription EspressoC13,500Minimize Boolean functions LiC7,413Lisp interpreter (9 queen probl.) EqntottC3,376translate boolean equations CompressC1,503Data compression ScC8,116Computation in a spreadsheet GccC83,589GNU C compiler Spice2g6Fortran18,476Circuit Simulation Package DoducFortran5,334Simulation of nuclear reactor Mdljdp2Fortran4,458Chemical application Wave5Fortran7,628Electromagnetic Simulation TomcatvFortran195Mesh generation program OraFortran535Traces rays through optical syst. AlvinnC272Simulation in neural networks EarC4,483Inner ear model
  • Slide 16
  • MEASURING PERFORMANCE Wall-clock time (elapsed time). Latency to complete a task, including disk accesses, input/output activities, memory accesses, OS overhead. CPU time. Not inclusion of time waiting for I/O or running another program. User CPU time. Time spent in the program System CPU time. Time spent in the OS
  • Slide 17
  • CPU Performance Measures MIPS (millions of instructions per second). How fast the machine can operate. MFLOPS (Floating-point). GFLOPS (Gigaflops). Other measures are Whets (Whetstone benchmark), VUP (VAX unit of performance), and SPECmarks. Note: Sometimes MIPS can mean meaningless indicators of performance for salesmen.
  • Slide 18
  • COMPARING PERFORMANCE Program P1 (secs) Program P2 (secs) Program P3 (secs) Computer A Computer BComputer C 11020 100010020 100111040 Execution times of three programs on three machines
  • Slide 19
  • CPU Performance Measures TOTAL EXECUTION TIME: An average of the execution times that tracks total execution time is the arithmetic mean n i=1 Time i n Where Time i is the execution for the ith program of a total of n in the workload When performance is expressed as a rate we use Harmonic mean: 1 n i=1 Rate i n Where Rate i is a function of 1/time i, the execution time for the ith of n programs in the workload. It is used when performance Is measured in MIPS or MFLOPS
  • Slide 20
  • CPU Performance Measures WEIGHTED EXECUTION TIME A question arises: What is the proper mixture of programs for the workload? In the arithmetic mean we assume programs P1 and P2 run equally in the Workload. A weighted arithmetic mean is given by n i=1 Weight i x Time i Where Weight i is the frequency of the ith program in the workload and Time i Is the execution time of the program i
  • Slide 21
  • CPU Performance Measures Program P1 (secs) Program P2 (secs) Arithmetic mean:W1 Arithmetic mean:W2 Comp A Comp B Comp C W1 W2 W3 1 10 20.50.909.999 1000 100 20.50.091.001 500.50 55.0 20.0 91.91 18.19 20.0 Arithmetic mean:W3 2.0 10.09 20.0 Weighted arithmetic mean execution times using three weightings
  • Slide 22
  • COMMON SYSTEM BENCHMARKS 007 (OODBMS). Designed to simulate a CAD/CAM environment. Tests: - Pointer traversals over cached data; disk resident data; sparse traversals; and dense traversals - Updates: indexed and unindexed object fields; repeated updates; sparse updates; updates of cached data; and creation and deletion of objects - Queries: exact match lookup; ranges; collection scan; path-join; ad-hoc join; and single-level make. Originator: University of Wisconsin Versions: Unknown Availability of Source: Free from ftp.cs.wisc.edu:/007 Availability of Results: Free from ftp.cs.wisc.edu:/007 Entry Last Updated: Thursday April 15 15:08:07 1993
  • Slide 23
  • AIM AIM Technology, Palo Alto Two suites (III and V) Suite III: simulation of applications (task- or device specific) - Task specific routines (word processing, database management, accounting) - Device specific routines (memory, disk, MFLOPs, IOs) - All measurements represent a percentage of VAX 11/780 performance (100%) In general, Suite III gives an overall indication of performance. Suite V: measures throughput in a multitasking workstation environment by testing: - Incremental system loading - Multiple aspects of system performance The graphically displayed results plot the workload level versus time. Several different models characterize various user environments (financial, publishing, software engineering). The published reports are copyrighted. An example of AIM benchmark resultsAn example of AIM benchmark results (in.pdf format)
  • Slide 24
  • Dhrystone Short synthetic benchmark program intended to be representative of system (integer) programming. Based on published statistics on use of programming language features; see original publication in CACM 27,10 (Oct. 1984), 1013-1030. Originally published in Ada, now mostly used in C. Version 2 (in C) published in SIGPLAN Notices 23,8 (Aug. 1988), 49-62, together with measurement rules. Version 1 is no longer recommended since state-of-the-art compilers can eliminate too much "dead code" from the benchmark (However, quoted MIPS numbers are often based on Version 1.) Problems: Due to its small size (100 HLL statements, 1-1.5 KB code), the memory system outside the cache is not tested; compilers can too easily optimize for Dhrystone; and string operations are somewhat over represented. Recommendation: Use it for controlled experiments only; don't blindly trust single Dhrystone MIPS numbers quoted somewhere (as a rule, don't do this for any benchmark). Originator: Reinhold Weicker, Siemens Nixdorf ([email protected]) Versions in C: 1.0, 1.1, 2.0, 2.1 (final version, minor corrections compared with 2.0) See also: R.P.Weicker, A Detailed Look... (see Publications, 4.3) Availability of source: [email protected], ftp.nosc.mil:pub/aburto Availability of results (no guarantee of correctness): Same as above
  • Slide 25
  • Khornerstone Multipurpose benchmark used in various periodicals. Originator: Workstation Labs Versions: unknown Availability of Source: not free Availability of Results: UNIX Review LINPACK Kernel benchmark developed from the "LINPACK" package of linear algebra routines. Originally written and commonly used in FORTRAN; a C version also exists. Almost all of the benchmark's time is spent in a subroutine ("saxpy" in the single-precision version, "daxpy" in the double-precision version) doing the inner loop for frequent matrix operations: y(i) = y(i) + a * x(i) The standard version operates on 100x100 matrices; there are also versions for sizes 300x300 and 1000x1000, with different optimization rules. Problems: Code is representative only for this type of computation. LINPACK is easily vectorizable on most systems. Originator: Jack Dongarra, Computer Science Deptartment, University of Tennessee, [email protected]
  • Slide 26
  • MUSBUS Designed by Ken J. McDonell at the Monash University in Australia a very good benchmark of disk throughput and the multi-user simulation. Compile, create the directories and the workload for simulated users, and execute the simulation three times by measuring cpu and elapsed time. The workload is constituted by 11 commands (cc, rm, ed, ls, cp, spell, cat, mkdir, export, chmod, and a nroff-like spooler) and 5 programs (syscall, randmem, hanoi, pipe, and fstime). This is a very complete test which is a significant measurement of the CPU speed, C compiler and UNIX quality, file system performances and multi-user capabilities, disk throughput, and memory management implementation.
  • Slide 27
  • Nhfsstone A benchmark intended to measure the performance of file servers that follow the NFS protocol. The work in this area continued within the LADDIS group and finally within SPEC. The SPEC benchmark 097.LADDIS is intended to replace Nhfsstone. It is superior to Nhfsstone in several aspects (multi-client capability, less client sensitivity).
  • Slide 28
  • SPEC SPEC stands for Standard Performance Evaluation Corporation, a non-profit organization whose goal is to "establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high performance computers" (from SPEC's bylaws). The SPEC benchmarks and more information can be obtained from: SPEC [Standard Performance Evaluation Corporation] c/o NCGA [National Computer Graphics Association] 2722 Merrilee Drive Suite 200 Fairfax, VA 22031 USA Phone: +1-703-698-9600 Ext. 325 FAX: +1-703-560-2752 E-Mail: [email protected] The current SPEC benchmark suites are: CINT92 (CPU intensive integer benchmarks) CFP92 (CPU intensive floating point benchmarks) SDM (UNIX Software Development Workloads) SFS (System level file server (NFS) workload) example: SPEC
  • Slide 29
  • SSBA The SSBA is the result of the studies of the AFUU (French Association of UNIX Users) Benchmark Working Group. This group, consisting of some 30 active members of varied origins (universities, public and private research, manufacturers, end users), has assigned itself the task of assessing the performance of data processing systems, collecting a maximum number of tests available throughout the world, dissecting the codes and results, discussing the utility, fixing versions, and supplying them with various comments and procedures. A sample output of the SSBA suite of UNIX benchmark tests
  • Slide 30
  • Sieve of Eratosthenes An integer program that generates prime numbers using a method known as the Sieve of Eratosthenes. TPC TPC-A is a standardization of the Debit/Credit benchmark which was first published in DATAMATION in 1985. It is based on a single, simple, update-intensive transaction which performs three updates and one insert across four tables. Transactions originate from terminals, with a requirement of 100 bytes in and 200 bytes out. There is a fixed scaling between tps rate, terminals, and database size. TPC-A requires an external RTE (remote terminal emulator) to drive the SUT (system under test). The system performs five kinds of transactions: entering a new order, delivering orders, posting customer payments, retrieving a customer's most recent order, and monitoring the inventory level of recently ordered items
  • Slide 31
  • WPI Benchmark Suite The first major synthetic benchmark program, intended to be representative for numerical (floating point intensive) programming. Based on statistics gathered at National Physical Laboratory in England, using an Algol 60 compiler which translated Algol into instructions for the imaginary Whetstone machine. The compilation system was named after the small town outside the City of Leicester, England, where it was designed (Whetstone). Problems: Due to the small size of its modules, the memory system outside the cache is not tested; compilers can too easily optimize for Whetstone; mathematical library functions are over represented. Originator: Brian Wichmann, NPL
  • Slide 32
  • Whetstone One of the first and very popular benchmarks, the WHETSTONE was originally published in 1976 by Curnow and Wichman in algol and subsequently translated into FORTRAN. This synthetic mix of elementary Whetstone instructions is modeled with statistics from about 1000 scientific and engineering applications. The WHETSTONE is rather small and, due to its straightforward coding, may be prone to particular (and unintentional) treatment by intelligent compilers. It is very sensitive to the transcendental and trigonometric functions processing, and completely dependent on fast or additional mathematics coprocessor. The WHETSTONE is a good predictor for engineering and scientific applications.
  • Slide 33
  • SYSmark SYSmark93 provides benchmarks that can be used to measure performance of IBM PC-compatible hardware for the tasks users perform on a regular basis. SYSmark93 benchmarks represent the workloads of popular programs in such applications as word processing, spreadsheets, database, desktop graphics, and software development.
  • Slide 34
  • Stanford A collection of C routines developed in 1988 at Stanford University (J. Hennessy, P. Nye). Its two modules, Stanford Integer and Stanford Floating Point, provide a baseline for comparisons between Reduced Instruction Set (RISC) and Complex Instruction Set (CISC) processor architectures Stanford Integer: - Eight applications (integer matrix multiplication, sorting algorithm [quick, bubble, tree], permutation, hanoi, 8 queens puzzle) Stanford Floating Point: - Two applications (Fast Fourier Transform [FFT] and matrix multiplication) The characteristics of the programs vary, but most of them have array accesses. There seems to be no official publication (only a printing in a performance report). Secondly, there is no defined weighting of the results (Sun and MIPS compute the geometric mean).
  • Slide 35
  • Bonnie This is a file system benchmark that attempts to study bottlenecks. Specifically, these are the types of filesystem activity that have been observed to be bottlenecks in I/O-intensive applications, in particular the text database work done in connection with the New Oxford English Dictionary Project at the University of Waterloo. It performs a series of tests on a file of known size. By default, that size is 100 Mb (but that's not enough - see below). For each test, Bonnie reports the bytes processed per elapsed second, per CPU second, and the percent CPU usage (user and system). In each case, an attempt is made to keep optimizers from noticing it's all bogus. The idea is to make sure that these are real transfers to/from user space to the physical disk.
  • Slide 36 = X MB, then most if not all of the reads will be satisfied from the cache. However, if the cache is