ieeemicro2007a

Embed Size (px)

Citation preview

  • 7/30/2019 ieeemicro2007a

    1/11

    ..........................................................................................................................................................................................................................................

    SINGLE-THREADED VS.MULTITHREADED: WHERE SHOULD

    WE FOCUS?..........................................................................................................................................................................................................................................

    TO CONTINUE TO OFFER IMPROVEMENTS IN APPLICATION PERFORMANCE, SHOULD

    COMPUTER ARCHITECTURE RESEARCHERS AND CHIP MANUFACTURERS FOCUS ON

    IMPROVING SINGLE-THREADED OR MULTITHREADED PERFORMANCE? THIS PANEL, FROM

    THE 2007 WORKSHOP ON COMPUTER ARCHITECTURE RESEARCH DIRECTIONS,

    DISCUSSES THE RELEVANT ISSUES.

    Moderators introduction: Joel Emer......Today, with the increasing popu-larity of multicore processors, one approachto optimizing the processors performance isto reduce the execution times of individualapplications running on each core bydesigning and implementing more powerfulcores. Another approach, which is the polaropposite of the first, optimizes the proces-sors performance by running a largernumber of applications (or threads ofa single application) on a correspondinglylarger number of cores, albeit simpler ones.The difference between these two approachesis that the former focuses on reducing thelatency of individual applications or threads(it optimizes the processors single-threadedperformance), whereas the latter focuses onreducing the latency of the applicationsthreads taken as a group (it optimizes theprocessors multithreaded performance).

    The obvious advantage of the single-threaded approach is that it minimizes theexecution times of individual applications(especially preexisting or legacy applica-

    tions)but potentially at the cost of longer

    design and verification times and lowerpower efficiency. By contrast, although the

    multithreaded approach may be easier todesign and have higher power efficiency, itsutility is limited to specifichighly paral-lelapplications; it is difficult to programfor other, less-parallel applications.

    Of course the multiprocessor versus uni-processor controversy is not a new issue. Myearliest recollection of it is from back in the1970s, when Intel introduced the 4004, thefirst microprocessor. I remember almostimmediately seeing proposals that said thatif we could just hook together a hundred ora thousand of these, we would have the bestcomputer in the world, and it would solve allof our problems. I suspect that no oneremembers the result of that research, as thelong series of increasingly more powerfulmicroprocessors is a testament to our successat (and the value of) improving uniprocessorperformance. Yet, with each successive gen-erationthe 4040, the 8080, all the way upto the presentweve seen exactly the samesort of proposal for ganging together lots of

    the latest-generation uniprocessors.

    Joel Emer

    Intel

    Mark D. Hill

    University of Wisconsin

    Madison

    Yale N. Patt

    University of Texas at

    Austin

    Joshua J. Yi

    Freescale Semiconductor

    Derek ChiouUniversity of Texas at

    Austin

    Resit Sendag

    University of Rhode

    Island

    ............................................................

    14 Published by the IEEE Computer Society 0272-1732/07/$20.00 G 2007 IEEE

  • 7/30/2019 ieeemicro2007a

    2/11

    So the question for these panelists is, whyis todays generation different from any ofthe past generations? Clearly, it is not justthat we cannot achieve gains in instructionsper cycle (IPC) in direct proportion to the

    number of transistors used, because thatsnever been true. If it had been true, wewould have IPCs several orders of magni-tude larger than those we have today. Ourrecent rule of thumb has been that processorperformance improves as the square root ofthe number of transistors, and cache-missrates likewise improve as the square root ofthe cache size. Yet, despite these sublineararchitectural improvements, until recently,uniprocessors have been the preferredtrajectory. Why is the uniprocessor trajec-

    tory insufficient for today? Are there noideas that will bring even that sublineararchitectural performance gain? Why arentthe order-of-magnitude gains being prom-ised by single-instruction, multiple-data(SIMD), vector, and streaming processorsof interest? Is the complexity of currentarchitectures a factor in the inability to pushacross the next architectural performancestep? Or is there a feeling that, irrespectiveof architecture, weve crossed a complexitythreshold beyond which we cant build

    better processors in a timely fashion?Looking at multiprocessors, is using full

    (but simpler) processors as the buildingblocks the right granularity? Are multi-processors really such a panacea of simplic-ity? How much complexity is going to beintroduced in the interprocessor intercon-nect, in the shared cache hierarchy, and inprocessor support for mechanisms liketransactional memory? Much of the chal-lenge of past generations has been copingwith the increasing disparity between pro-cessor speed and memory speed, or thelimits of die-to-memory bandwidth. Doeshaving a multiprocessor with its multiplecontexts make this problem worse instead ofbetter? And, even if multiprocessors reallyare a simpler alternative, what is theapplication domain over which they willprovide a benefit? Will enough people beable to program them?

    As I hope is clear, both approaches haveadvantages and disadvantages. It is, howev-

    er, unclear which approach will be more

    heavily used in the future, and should be themajor focus of our research. The goal of thispanel was to discuss the merits of eachapproach and the trade-offs between thetwo.

    The case for single-threaded optimization:Yale N. Patt

    The purpose of a chip is to optimallyexecute the users desired single application.

    In other words, the reason for designing anexpensive chip, rather than a network ofsimple chips, is to speed up the execution ofindividual programs. To the user, optimallyexecuting a desired application meansminimizing its execution time. Amdahlslaw says the maximum speedup in theexecution time is limited by the time it takesto execute the slowest component of theprogram. Therefore, when we parallelize anapplication to run multiple threads acrossmultiple cores, the maximum speedup is

    limited by the time that it takes to executethe serial part of the program, or the onethat needs to run on a powerful single core.

    For example, suppose a computer archi-tect designs a processor for a specificoperating system that needs to periodicallyexecute an old VAX instruction. In theextreme, assume that this instruction showsup only once every two weeks. In that case,because that instruction is not frequentlyexecuted, the computer architect does notallocate much hardware to execute that

    instruction. In the worst case, if it takes theprocessor three weeks to execute the in-struction, then the program will never finishexecuting, because the processor did not runthe slowest part of the program fast enough.

    Although this example is clearly exaggerat-ed, it motivates the need to continue tofocus on improving the single-threadedperformance of processors.

    One potential objection to continuing tofocus heavily on single-threaded microarch-

    itectural research is that there is insufficient

    .....................................................................................................................................................................

    About this article

    Derek Chiou, Resit Sendag, and Josh Yi conceived and organized the 2007 CARD

    Workshop and transcribed, organized, and edited this article based on the panel discussion.

    Video and audio of this and the other panels can be found at http://www.ele.uri.edu/CARD/.

    ...................................................................

    NOVEMBERDECEMBER 2007 15

  • 7/30/2019 ieeemicro2007a

    3/11

    parallelism at the instruction level tosignificantly improve processor perfor-mance, whereas there may be significantlymore parallelism at the thread level that canbe exploited instead. However, difficultproblems still need to be solved, and thecomputer architecture community will notsolve them if the best and brightest aresteered away from those problems. Creativ-ity and ingenuity can solve those problems.

    For example, in the RISC vs. CISC debate,CISC did not win merely because of legacycode. Rather, CISC won the debate becausecomputer architects had the ingenuity todevelop solutions such as out-of-orderexecution combined with in-order retire-ment, wide issue, better branch prediction,and so on. For example, Figure 1 shows theheadroom remaining for a 16-wide issueprocessor that uses the best branch predictorand prefetcher available at the time themeasurements were made.1

    Take, for example, the creativity ofcomputer architects with respect to branchprediction. Conventional wisdom in 1990pretty well accepted as fact that instruction-level parallelism on integer codes wouldnever be greater than 1.85 IPC. However,even simple two-level branch predictorscould improve the processors performancebeyond 1.85 IPC. And, today, computerarchitects such as Andre Seznec and Daniel

    Jimenez have proposed more sophisticated

    branch predictors that are better than the

    two-level branch predictor. There is stillquite a bit more unrealized performance tobe gotten from better branch prediction andbetter memory hierarchies (see Figure 1.)The key point is that the computerarchitecture community should not avoidthe difficult problems, but rather shouldharness its members creativity and ingenu-ity to propose solutions and solve thoseproblems.

    Another potential objection to the single-threaded focus is power consumption.Power consumption in a 10-billion-transis-tor processor is a significant problem.However, in a processor with 10 billontransistors, the processor could have severallarge functional units that remain poweredoff when not in use, but that are poweredup via compiled code when necessary tocarry out a needed piece of work specifiedby the algorithm. I use my Refrigeratoranalogy to explain this type of microarch-

    itecture. William Refrigerator Perry wasa huge defensive tackle for the 1985Chicago Bears. While the Bears were onoffense, Perry sat on the bench until theywere on the one yard line. Then, they wouldpower him up to carry the ball fora touchdown. To me, the key question is,How much functionality can we put intoa processor that remains powered off until itis needed?

    I think the processor of the future will be

    what I call Niagara X, Pentium Y. The

    Figure 1. Potential of improving caches and branch prediction.

    .........................................................................................................................................................................................................................

    COMPUTER ARCHITECTURE DEBATE

    ............................................................

    16 IEEE MICRO

  • 7/30/2019 ieeemicro2007a

    4/11

    Niagara X part of the processor consists ofmany very lightweight cores for processingthe embarrassingly parallel part of analgorithm. The Pentium Y part consists ofvery few, heavyweight coresor perhaps

    only onewith serious hybrid branch pre-diction, out-of-order execution, and so on,to handle the serial part of an algorithm tomitigate the effects of Amdahls law. Toensure that the processors performancedoes not suffer because of intrachip com-munication, a high-performance intercon-nect will need to connect the Pentium Ycore to the Niagara X cores. Additionally,computer architects need to determine howtransistors can minimize the performanceimplications of off-chip communication.

    The case for multithreaded optimization:Mark D. Hill

    For decades, increasing transistor budgetshave allowed computer architects to im-prove processor performance by exploitingbit-level parallelism, instruction-level paral-lelism, and memory hierarchies. However,several walls will steer computer architectsaway from continuing to design increasinglymore powerful single-threaded processorsand toward designing less complex, multi-

    threaded processors.The first wall is power. Some current-

    generation processors require large, evenabsurd, amounts of power. For these pro-cessorsand to a lesser degree, other lower-power processorshigh power consumptionmakes it more difficult and expensive to coolthe processor or forces the processor tooperate at a higher temperature, which canlower the processors lifetime reliability. Inboth cases, higher power consumption eitherincreases the operational cost of owning theprocessor, by increasing the costs of electric-ity and server room cooling, or decreases thebattery life. The current situation is verysimilar to the one mainframe developersfaced about 10 to 15 years ago with emitter-coupled logic. Although mainframe devel-opers were able to switch to an alternativetechnology that was initially slower buteventually proved betternamely CMOSthe computer architecture community doesnot currently have a viable alternative

    technology that we can switch to. Thus, we

    must adopt a different approach to designingour processors.

    The second wall is complexity. The basicissue with the complexity wall is that it isbecoming too difficult, and taking too long,

    to design and verify next-generation pro-cessors, and to manufacture them atsufficiently high yield rates. Although somecomputer architects think that this isa significant wall, I do not, since I believethat creative people like Joel Emer and YalePatt can create abstractions to manage thecomplexity.

    The third, and final, wall is memory,which I think is a very significant problem.Currently, accessing main memory requirespossibly hundreds of core cycles, which can

    be hundreds of times slower than a floating-point multiply. Consequently, accessingmain memory wastes hundreds, or eventhousands, of instruction execution oppor-tunities. How much instruction-level orthread-level parallelism is available for theprocessor to exploit while waiting fora memory access to complete? And, even ifthere is a large amount of parallelism, thepower and complexity walls limit how thecomputer architect can design a processor toexploit that parallelism.

    Given these three walls, the computerarchitecture community needs to continueto design processors that exploit parallelismif we want to continue to improve processorperformance. However, I do not believe thatwe can continue to exploit instruction-levelparallelism with identical operations, that is,SIMD vectors. Rather, we need to exploita higher-level parallelism in which we cando slightly different things along eachparallel path. Note that this higher-levelparallelism could be like the thread-levelparallelism that has been spectacularlysuccessful in narrow domains such asdatabase management systems, Web servers,and scientific computing, but less successfulfor other application spaces. More specifi-cally, instead of exploiting bit-level andinstruction-level parallelism with wider,more powerful single-threaded processors,we need to use multicore processors toexploit higher-level parallelism.

    My approach to this problem is to

    advocate the development of new hardware...................................................................

    NOVEMBERDECEMBER 2007 17

  • 7/30/2019 ieeemicro2007a

    5/11

    and software models (Figure 2). The archi-tecture community needs to raise the levelof abstraction at the software layer. Al-though most people naturally do not thinkin parallel, there are tremendous successstories, like SQL language for relationaldatabases and Googles MapReduce. In theformer case, most people can write de-

    clarative queries and let the underlying

    system optimize and create great parallelismbecause of the strong properties of therelational algebra. In the latter case, a limitedset of programs allow the user to programeven clusters relatively easily. However,

    these two examples are point solutions only;we need to develop these types of solutionsmore broadly.

    The key at the hardware level is to supportheterogeneous operations, and not just strictSIMD. Additionally, the computer architec-ture community needs to add mechanismsthat can improve the performance of soft-ware models; examples include transactionalmemory and speculative parallelism (forexample, thread-level speculation). Addingthese hardware features helps support the

    software model, which allows the program-mer to continue to think sequentially.

    Multicore architectureHill: [removing faux white beard he has beenwearing in imitation of Patt; see Figure 3]Let me take this beard off. Its making methink way too sequentially. So, Yale, is itcorrect that you think that we should havechips with lots of Niagaras and one beefiercore? Are you saying that Ive won the

    debate?

    Figure 2. Toward a future hardware/software stack.

    Figure 3. Panelists Yale Patt and Mark Hill and moderator Joel Emer (left to right). Hill

    removed his beard during the discussion, claiming it caused him to think too sequentially.

    Patt and Emer declined to remove theirs.

    .........................................................................................................................................................................................................................

    COMPUTER ARCHITECTURE DEBATE

    ............................................................

    18 IEEE MICRO

  • 7/30/2019 ieeemicro2007a

    6/11

    Patt: No.

    Hill: Youre trying to say thats theuniprocessor argument?

    Patt: Im saying that you need one or

    more fast uniprocessors on the chip.Without that, Amdahls law gets in theway. That is, yes, your Niagara thing willrun the part of the problem that isembarrassingly parallel. But what aboutthe thing that really needs a serious branchpredictor or maybe a trace cache? I have noidea what people are going to come up within the future. What I dont want is forpeople to get in the dont-worry-about-itmind-set.

    Hill: Okay, I agree that we need to havemultiple cores and that one of those coresshould be much more powerful, or alterna-tively several of those cores should be muchmore powerful.

    Emer: So now Yale has won the debate.

    Hill: No, thats multiple cores.

    Patt: Im not suggesting that havingmultiple cores is not important. What Iam suggesting is that we shouldnt forgetthe part of the problem that requires theheavy-duty single core. You know, you heara lot of people say, The time for that haspassed. There are so many opportunities tofurther improve single-core performance.Take the stuff they did at UniversitatPolitecnica de Catalunya (UPC) withvirtual allocate, where, at rename time, theyallocate a reorder buffer register but dontactually use it until it is really needed. Theysave the power from the time they rename it

    until the time they actually use it: doingthings virtually until you really need it,turning off hardware until you really needit Theres a whole lot out there that infact students in this room will work on,maybe, if theyre not told, Forget it; thereal action is in multiple cores. The fact ofthe matter is, [smiling] the real action is in

    Web design.

    Hill: So, in your view what is the rate ofperformance improvement that we can

    expect to get out of a single core?

    Patt: The rate is zero, if we think negatively.Sammy Davis Jr. wrote an autobiographyentitled Yes I Can. Thats what I would liketo encourage people to think: Yes I canas opposed to just giving up and saying,

    You know, when its 10 billion transistors,I guess it wont be 20 Pentium 4s; it will be200 Pentium 4s.

    Hill: I guess I should remind you that Iabsolutely think that we can push unipro-cessors forward. We can get improvements,but were not going to see the improve-ments that weve seen in the past. And thereare markets out there that have grown usedto this opium of doubling in performance.That is what we are going to lose going

    forward.

    Emer: How much architectural improve-ment have we seen?

    Patt: Weve seen quite a bit. In fact, therewas a 2000 Asia and South Pacific Design

    Automation Conference presentation bya guy at Compaq/DEC [Digital EquipmentCorporation] named Bill Herrick, who saidthat between the [Alpha] EV-4 and EV-8there was a factor of 55 improvement inperformance.2 That represented a timeframeof about 10 years. The contribution fromtechnology, essentially in cycle time, wasa factor of 7. Thus, the contribution frommicroarchitecture was more than the con-tribution from technology. EV-4 was thefirst 21064, and EV-8, had it been de-livered, would have been the 21464.

    Take branch predictors. The Alpha21064 used a last-taken branch predictor.The Alpha 21264 used a hybrid two-levelpredictor. Or, in-order versus out-of-order

    execution: The first Alpha was in-orderexecution. The third one was out-of-orderexecution. Issue width: You know the first

    Alpha was two-wide issue. The second onewas four-wide issue. I am not saying that weshould tell these students, No multiplecores. I agree, we need to teach thinkingand programming in parallel. But we alsoneed to expose them to algorithms and to allof the other levels down to the circuit level.

    Audience member: [interjecting] Because

    its a bad idea. Abstraction is a good thing....................................................................

    NOVEMBERDECEMBER 2007 19

  • 7/30/2019 ieeemicro2007a

    7/11

    Patt: Abstractions a good thing? Abstrac-tion is a good thing if you dont care aboutthe performance of the underlying entities.

    You know, many schools teach freshmanprogramming in Java. So whats a data

    structure? Who cares? My hero is DonaldKnuth, who teaches data structures showingyou how data is stored in memory. Knuthsays that unless the programmer under-stands how the data structure is actuallyrepresented in memory, and how thealgorithm actually processes that data struc-ture, the programmer will write inefficientalgorithms. I agree. Ask a current graduatein computer science, Does it matterwhether or not all of the data to be sortedcan be in memory at the same time? How

    many will say Yes and pick their sortingalgorithm accordingly?

    Programming parallel machinesAudience member: Id like to pick up onthat point. There seem to be a lot of peoplenow who argue that we wont realize theperformance benefits of multicore unlessprogrammers explicitly manage data local-ity. And in some cases thats not so hard if itnaturally fits a stream or some otherparadigm. But is this a realistic expectation?

    And what do we do?

    Patt: I think there are two types ofprogrammers: yo-yos and serious program-mers. And if you really want performance,you dont want yo-yos. So, yes, for everyprogrammer who knows what he or she isdoing, there are thousands who dont.Object-oriented isolates the programmerfrom whats going on, but dont expectperformance.

    Hill: But I dont think its as simple as that.I think that there are a lot of cases wherepeople are solving very deep intellectualchallenges in their problem domains. In-sofar as we can automatically help themcreate the locality, I think its a good thingand we should do it.

    Patt: Absolutely.

    Hill: I am willing to give up someperformance. I just dont see people man-

    aging the details of locality. I think the

    tougher problem is parallelizing the workand coming up with the abstractionssomewhere in the tool chain so that thework will be parallelized. I think thatlocality is also important, but it is not the

    hardest thing to do.

    Patt: Yes, the developer needs tools to breakdown the work at large levels, but if youwant performance, youre going to needknowledge of whats underneath.

    Providing simpler programming modelsHill: I still think that even sequentialconsistency is too complicated for thoseprogrammers. Youre already reasoning witharbitrary threads and arbitrary nondetermin-

    ism and locks, so who wants to also reasonabout memory reordering and fences?

    Patt: You know, Im hopeful that Mark isright and that someday we will thinkparallel.

    Hill: The key is enabling people to think ata higher level of abstraction. Take SQL:There is a lot of parallelism underneath, butpeople dont think parallel when they writeSQL.

    Patt: So, somehow, magically, these in-termediate layers will do it for you?

    Emer: Thats the research that Markadvocates pursuing more aggressively, inlieu of the solutions for higher single-streamperformance that youre advocating.

    Hill: In addition to! Weve got to knowa two-handed economist.

    Patt: And certainly there are people work-

    ing on trying to take the sequentialabstraction of the program and having thecompiler automatically parallelize that code.

    Hill: I dont think thats good enough. Imean, just taking C or Java and parallelizingit is not good enough. We need to go higherthan that.

    Patt: I agree with that. In fact, thats Wen-mei [Hwus] point, that you should havethe library available and the heavy-duty

    logic available, which I agree with also.

    .........................................................................................................................................................................................................................

    COMPUTER ARCHITECTURE DEBATE

    ............................................................

    20 IEEE MICRO

  • 7/30/2019 ieeemicro2007a

    8/11

    Checkpointed processorsAudience member: Any thoughts oncheckpointed processors and how theychange the trade-off of processor complexityand performance, and perhaps the func-

    tionality for multithreading.

    Hill: I think checkpointed processors area very interesting and viable technique thatare going to make uniprocessors better, butthere are also arguments that they can helpmultiprocessing. For example, you havea paper accepted to ISCA on bulk sequentialconsistency (SC) where the checkpointedprocessors help multiprocessing. I dont seecheckpointed processors as fundamentallytipping this debate, but they are a good idea.

    Emer: So, youre saying that theres researchthat applies to both uniprocessor andmultiprocessor domains, potentially.

    Patt: Thats right.

    Breaking down the barriersaround architects

    Audience member: In general, I agree withMark [Hill], and Id like to amplify hiscomments that good abstraction to paral-

    lelism using a parallel programming modelis probably the most important point. Howcan architects enable that? In my view, itsnot going to happen with just architectsworking alone. Its not going to happen ifthe other people are working alone. Wereally need to work together, and its veryhard to do that. So, the question is, whatcan we really do to enable that synergy?

    Hill: I completely agree. Doing it alone, wecan make some progress. But to have the

    real progress, we have to work all the way upto theoreticians. Well, there are two wayswere going to do this. One is if we canconvince our colleagues to do the research.The other is if performance improvementfalls flat on its face for a number of years;then theyll start paying attention. I thinkwe want to try to use the former route,because the latter route is not going to bepretty.

    Emer: Weve been in a situation for

    a number of years where weve been able

    to work behind an architectural boundaryto make processors go faster, and softwarepeople can be largely oblivious to whatwere doing.

    Hill: A very concrete example of this is thatMicrosoft had no interest in architectureresearch until recently. And suddenly itoccurred to them that being ignorant ofwhats happening on the other side of theinterface is no longer a viable strategy.

    Emer: They probably did get a lot ofperformance by being oblivious. Thats allthe legacy code that we do have.

    Audience member: Actually, a good anal-ogy is the memory consistency model issue.

    The weaker models were exposed to thesoftware community for the longest time,and they chose to ignore them. But in thelast five or six years, theres been this hugeeffort from the software community to getthings fixed there. So, I do see a hope thatthe other people will band together witharchitects, but I think that something needsto be done proactively to enable thatsynergy to actually happen.

    Patt: Yeah, so what [the audience member]

    has opened up is whether this should bea sociology debate rather than a hardwareone, which I think is right. I think we areuniquely positioned where we are asarchitects because were the center. Thereare the software people up here, and thecircuit designers down here, and weifwere doing our jobwe see both sides. Weare uniquely positioned to engage thosepeople. Historically, you said that softwarepeople are oblivious, and yet they still get

    performance because we did our job so well.I think we can continue to do our job sowell. I dont think were going to run out ofperformance if we address Amdahls bottle-neck. For a certain amount of time I think[the audience member] is right, that weneed to be engaging people and problemson a number of levels. In fact, I would say,the number one problem today has nothingto do with performance, and that is security.

    Hill: I just want to add one more comment.

    I think whats really tough is to parallelize...................................................................

    NOVEMBERDECEMBER 2007 21

  • 7/30/2019 ieeemicro2007a

    9/11

    code. You may notice that at Illinois tooa lot of software people did some parallelwork in the 1980s and early 1990s. But theefforts petered out, and they left! And nowthere are very few parallel software experts

    around.

    What if there were no power wall?Audience member: Hypothetically, if wedidnt have a power wall, would we be ableto continue scaling uniprocessor perfor-mance and thus avoid having to go afterparallelism?

    Hill: Theres still the memory wall. Thepower wall makes it worse because of thehigh cost in power for many forms of

    speculation.

    Patt: If the biggest problem today is thememory wall, how do we make thisbottleneck in the code we want to run stillperform well? Again, this is an argument forfaster uniprocessors. There will always stillbe issues with the memory wall, intercon-nect, and other areas that get in the way ofsingle-threaded performance. The powerwall has influenced our thinking and hasalso helped us to cop out, because architects

    reason that putting a number of identicallow-power cores on the same chip solves thepower wall problem.

    Hill: I agree that just doubling the numberof cores is a cop-out.

    Data parallelism?Audience member: Mark [Hill] has seemedto downplay the potential for data parallel-ism on the hardware level. It is muchsimpler to exploit data parallelism from

    a workload than thread-level parallelism,especially when dealing with a large numberof cores (about 1,000) on a chip.

    Emer: This kind of parallelism is muchmore power-efficient than multiple-coreparallelism, because one can save all thebookkeeping overhead.

    Hill: We need to have data parallelism (notSIMD), but I do not believe that dataparallelism is as simple as doing identical

    operations in the same place due to

    conditionals, encryption/decryption, andso on. Its not the same at the lowest level.Im not convinced that data parallelism issufficient at the lowest operation level, butit is needed in the programming model.

    Vectors are a cool solution, but theyhavent seemed to generalize beyond a fewclasses of applications these past 35 years.The difference between vector processingand multicore processing is that we mayhave no alternative with the latter. Thereason for this is that the uniprocessors willget faster through creative efforts on the partof people like Yale [Patt], but not at a rateof 50 percent a year. We need not onlyadditional cores, but also additional mech-anisms to use them, and that is what

    computer architects have to invent. Simplystamping out the cores may be okay forserver land, but not for clients.

    Patt: Im not interested in server landalthough thats what made SMT (simulta-neous multithreading). If you have lots ofdifferent jobs, why not just have simplerchips and let this job run here and that jobrun there?

    Emer: What does SMT have to do with

    that?Patt: SMT was a solution looking fora problem, and the problem it found wasservers.

    Hill: I completely disagree with Yale.Multiple threads are a way to hide memorylatency.

    Patt: Using SMT to hide memory latency isapplication dependent, and not alwaysa solution to the memory wall problem.

    Thus, we come back to Amdahls law,which we need to continue to address.

    Emer: Thats what SMT allowed you to do!

    The relationship between localityand parallelism

    Audience member: You said that locality isless of a problem than parallelism. I wouldargue that localization is parallelism. If youdont have independent, local tasks, then youcant have parallelism; both are strongly

    intertwined. Looking at parallel applications,

    .........................................................................................................................................................................................................................

    COMPUTER ARCHITECTURE DEBATE

    ............................................................

    22 IEEE MICRO

  • 7/30/2019 ieeemicro2007a

    10/11

    they did a good job except when parallelismand locality were in conflict, resulting in toomuch communication.

    Hill: Multicore does change things. For

    example, multicore has much better on-chip, thread-level communication than waspreviously possible. But multicore chips dohave a different type of locality, in that theunion of the cores should not thrash thechips last-level cache. I would like pro-grammers to manage locality, but it seemsvery difficult to do so.

    Audience member: I agree that its hard,but if you dont, it doesnt work.

    Emer: Yale, [the audience member] is

    supporting you.

    Patt: Then I should just sit quietly.

    Where should industry put its resources?Audience member: The business sidesimply wants applications to run faster.

    Where should computer manufacturers puttheir resources? What should they produce?

    Emer: This is the fundamental question:How do we allocate our resources as chip

    designers?

    Patt: I would produce Niagara X, PentiumY. Id worry about lots of mickey-mousecores and a few heavy-duty cores with a goodinterconnection network so you dont haveto suffer when you move between cores.Some would argue whether the memorymodel on the chip is correct, but I thinkthats the chip of the future.

    Hill: I agree with Yale that a multiprocessor

    like his Niagara X, Pentium Y is a goodidea. I would also put effort into hardwaremechanisms, compiler algorithms, and pro-gramming-model changes that can make iteasier to program these machines.

    Patt: I agree with that. MICRO

    ................................................................................................

    References

    1. R. Chappell, private communication, 2003.

    2. B. Herrick, Design Challenges in Multi-

    GHz Microprocessors, presentation, 2000.

    Asia South Pacific Design Automation Conf.

    (ASP-DAC 00); http://www.aspdac.com/

    2000/eng/ap/herrick2.pdf.

    Joel Emer is an Intel Fellow and director of

    microarchitecture research at Intel, where heleads the VSSAD group. He also teachespart-time at the Massachusetts Institute ofTechnology. His current research interestsinclude performance-modeling frameworks,parallel and multithreaded processors, cacheorganization, processor pipeline organiza-tion, and processor reliability. Emer hasa PhD in electrical engineering from theUniversity of Illinois. He is an ACM Fellowand a Fellow of the IEEE.

    Mark D. Hill is a professor in both theComputer Sciences and Electrical andComputer Engineering Departments at theUniversity of WisconsinMadison, wherehe also coleads the Wisconsin Multifacetproject with David Wood. His researchinterests include parallel computer systemdesign, memory system design, computersimulation, and transactional memory. Hillhas a PhD in computer science from theUniversity of California, Berkeley. He isa Fellow of the IEEE and the ACM.

    Yale N. Patt is the Ernest Cockrell Jr.Centennial Chair in Engineering at theUniversity of Texas at Austin. His researchinterests focus on harnessing the expectedbenefits of future process technology tocreate more effective microarchitectures forfuture microprocessors. He is a Fellow ofthe IEEE and the ACM.

    Joshua J. Yi is a performance analyst atFreescale Semiconductor in Austin, Texas.His research interests include high-perfor-mance computer architecture, simulation,low-power design, and reliable computing.

    Yi has a PhD in electrical engineering fromthe University of Minnesota, Minneapolis.He is a member of the IEEE and the IEEEComputer Society.

    Derek Chiou is an assistant professor in theElectrical and Computer Engineering De-partment at the University of Texas at

    Austin. His research interests include com-...................................................................

    NOVEMBERDECEMBER 2007 23

  • 7/30/2019 ieeemicro2007a

    11/11

    puter system simulation, computer archi-tecture, parallel computer architecture, andInternet router architecture. Chiou hasa PhD in electrical engineering and com-puter science from the Massachusetts In-

    stitute of Technology. He is a seniormember of the IEEE and a member of the

    ACM.

    Resit Sendag is an assistant professor in theDepartment of Electrical and ComputerEngineering at the University of RhodeIsland, Kingston. His research interestsinclude high-performance computer archi-

    tecture, memory systems performance is-sues, and parallel computing. Sendag hasa PhD in electrical and computer engineer-ing from the University of Minnesota,Minneapolis. He is a member of the IEEE

    and the IEEE Computer Society.

    Direct questions and comments about thisarticle to Joel Emer, [email protected].

    For more information on this or any

    other computing topic, please visit our

    Digital Library at http://computer.org/

    csdl.

    .........................................................................................................................................................................................................................

    COMPUTER ARCHITECTURE DEBATE

    ............................................................

    24 IEEE MICRO