1
RESEARCH POSTER PRESENTATION DESIGN © 2012 www.PosterPresentations.com Despite the proliferation of multi-core, multi-threaded systems single- thread performance is still an important processor design goal Modern programs do not lack instruction level parallelism (ILP) Real challenge: exploit implicit parallelism without undue costs One effective approach: Decoupled look-ahead Mo2va2on Baseline Decoupled Lookahead Architecture Look-ahead binary (skeleton) offers more parallelism because certain dependencies are removed during slicing for skeleton Look-ahead is more error-tolerant due to lack of correctness constraint Can ignore occasional dependence violations Little to no support needed, unlike in conventional TLS Not all instructions are equally important and critical for the final outcome; Plenty of weak instructions are present in a typical program Weak instructions can be removed safely from the look-ahead thread to speedup the look-ahead agent without degrading the quality of it Lookahead Accelera2on via Weak Dependence Removal Performance Benefits of Selftuned Lookahead Speedup over baseline decoupled look-ahead: 1.16x Speedup over single-thread baseline: 1.78x Summary and Insights Decoupled look-ahead can uncover significant implicit parallelism Look-ahead thread often becomes a new bottleneck Fortunately, look-ahead lends itself to various optimizations due to lack of hard correctness constraints Speculative parallelization is more beneficial in look-ahead thread compared to main program thread due to increased parallelism Weak instructions can be removed w/o affecting look-ahead quality Intelligent look-ahead technique is a promising solution in the era of flat frequency and modest microarchitecture scaling References [1] A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Int’l Symp. On Microarch., Nov 2008. [2] A. Garg, R. Parihar and M. Huang. Speculative Parallelization in Decoupled Look-ahead. In Proc. Int’l Conf. On Parallel Arch. and Compilation Techniques, Oct 2011. [3] R. Parihar and M. Huang. Accelerating Decoupled Look-ahead via Weak Dependence Removal. (Submitted), May 2013. The look-ahead thread (skeleton) runs on a separate core and maintains its memory image in local L1, no writeback to shared L2 Look-ahead thread sends execution based branch outcome hints through FIFO queue; also helps prefetching in the shared L2 cache Figure: In right half applications slower look-ahead thread is the bottleneck which slows down the overall decoupled look-ahead system; Number shown on top of each bar is the potential which can be achieved by speeding up the slow look-ahead thread Figure: Speedup of baseline look-ahead and speculatively parallel look-ahead over single-thread baseline Look-ahead thread is a self-reliant entity, independent of main thread that entails low management overhead on the main thread No need for quick spawning and register communication support Natural throttling to prevent runaway prefetching and cache pollution 0DLQ &RUH %UDQFK 4XHXH /RRNDKHDG &RUH / / ([HFXWHV /RRNDKHDG WKUHDG ([HFXWHV 0DLQ WKUHDG / 5HJLVWHU VWDWH V\QFKURQL]DWLRQ 3UHIHWFKLQJ KLQWV %UDQFK SUHGLFWLRQ DGGT Y Y Y QRS EJW D [ID VXET Y W D DGGT Y Y Y VXET Y W D FPRYJH D D Y DGGT Y Y Y VXET Y W D FPRYJH D D Y VXET D [ D DGGT Y Y Y EJW D [ID VXET Y W D )LJXUH %DVHOLQH 'HFRXSOHG /RRNDKHDG 6\VWHP Experimental Setup Lookahead Thread: A New BoKleneck Prac2cal Advantages of Decoupled Lookahead Raj Parihar, Michael C. Huang { parihar@ece , michael.huang @} rochester.edu Advanced Computer Architecture Laboratory (ACAL), University of Rochester, Rochester, NY 14627 Accelera2ng Decoupled Lookahead to Exploit Implicit Parallelism *HQHWLF $OJRULWKP 3URJUDP %LQDU\ 3DUVHU )LWQHVV (YDOXDWRU ,QLWLDO 6HHGV /DXQFK ILWQHVV WHVW &ROOHFW ILWQHVV VFRUH 5HPRYH *HQHV )HHG VNHOHWRQ 1RWLI\ +*$ +LJK (QG 6HUYHU HJ %OXHKLYH )LJXUH +\EULG *HQHWLF $OJRULWKP )UDPHZRUN /RFDO :RUNVWDWLRQ HJ $FDOVUY 6HFXUH 6KHOO 66+ VFS TVXE HWF Lookahead Accelera2on via Specula2ve Paralleliza2on Acknowledgements NSF (Grant # CCF-0747324), NSFC (Grant # 61028004), Alok Garg Selftuned and Specula2vely Parallel Lookahead In some cases, self-tuned and speculatively parallel look-ahead techniques are synergistic (ammp, art) Self-tuned + Speculative parallel look-ahead speedup: over single thread baseline – 1.84x; over decoupled look-ahead baseline – 1.20x Gene2c Algorithm based Framework Genetic algorithm based framework can be reliably used to identify and eliminate weak instructions from the look-ahead skeleton Chromosome creation -> Crossover and mutation -> Natural selection Program/binary and dependence analysis tool: based on ALTO Simulator: based on heavily modified SimpleScalar, look-ahead support Genetic algorithm framework: a supervisor program written in C/C++ So_ware, Hardware and Run2me Support Experimental Analysis and Results Speedup of decoupled look-ahead over single-thread baseline: 1.53x Speedup of speculative parallel look-ahead over single-thread: 1.73x Figure: ILP limit study of SPEC 2000 INT applications for various instructions windows; Left three bars measure the ILP in a ideal system, whereas right three bars in the presence of realistic branch mispredictions and cache misses Figure: Speedup of conventional TLS over single-thread, and speedup of speculatively parallel look-ahead over decoupled look-ahead. Speculative parallel look-ahead achieves higher speedup over a more aggressive baseline Figure: Long-distance parallelism present in skeleton Software support: coarse-grain dependence analysis, finding of target and spawn points, exploitation of loop-level parallelism Hardware support: spawning support for a new thread, value communication through registers, partial cache versioning Runtime support: squash spawned thread if dependence violation occurs Figure: Available parallelism for 2 core/contexts system Figure: Examples of weak instructions (dark) in application vpr Figure: Distributions of weak and strong insts Weak Dependences: Opportuni2es and Challenges Example of weak instructions: Inconsequential adjustments, Load and store instructions that are (mostly) silent, Dynamic NOP instructions Challenges involved: Context-based, hard to identify and combine – much like game Jenga, also interact with surrounding instructions Speedup of conventional TLS over single-thread baseline: 1.07x Speculative parallel look-ahead over decoupled look-ahead: 1.13x

QUICK TIPS ...parihar/papers/poster_circ.pdfQUICK DESIGN GUIDE (--THIS SECTION DOES NOT PRINT--) This PowerPoint 2007 template produces a 36”x60” professional poster. You can use

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: QUICK TIPS ...parihar/papers/poster_circ.pdfQUICK DESIGN GUIDE (--THIS SECTION DOES NOT PRINT--) This PowerPoint 2007 template produces a 36”x60” professional poster. You can use

QUICK DESIGN GUIDE (--THIS SECTION DOES NOT PRINT--)

This PowerPoint 2007 template produces a 36”x60” professional poster. You can use it to create your research poster and save valuable time placing titles, subtitles, text, and graphics. We provide a series of online tutorials that will guide you through the poster design process and answer your poster production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK. When you are ready to print your poster, go online to PosterPresentations.com. Need Assistance? Call us at 1.866.649.3004

Object Placeholders

Using the placeholders To add text, click inside a placeholder on the poster and type or paste your text. To move a placeholder, click it once (to select it). Place your cursor on its frame, and your cursor will change to this symbol . Click once and drag it to a new location where you can resize it. Section Header placeholder Click and drag this preformatted section header placeholder to the poster area to add another section header. Use section headers to separate topics or concepts within your presentation. Text placeholder Move this preformatted text placeholder to the poster to add a new body of text. Picture placeholder Move this graphic placeholder onto your poster, size it first, and then click it to add a picture to the poster.

RESEARCH POSTER PRESENTATION DESIGN © 2012

www.PosterPresentations.com

QUICK TIPS (--THIS SECTION DOES NOT PRINT--)

This PowerPoint template requires basic PowerPoint (version 2007 or newer) skills. Below is a list of commonly asked questions specific to this template. If you are using an older version of PowerPoint some template features may not work properly.

Template FAQs

Verifying the quality of your graphics Go to the VIEW menu and click on ZOOM to set your preferred magnification. This template is at 100% the size of the final poster. All text and graphics will be printed at 100% their size. To see what your poster will look like when printed, set the zoom to 100% and evaluate the quality of all your graphics before you submit your poster for printing. Modifying the layout This template has four different column layouts. Right-click your mouse on the background and click on LAYOUT to see the layout options. The columns in the provided layouts are fixed and cannot be moved but advanced users can modify any layout by going to VIEW and then SLIDE MASTER. Importing text and graphics from external sources TEXT: Paste or type your text into a pre-existing placeholder or drag in a new placeholder from the left side of the template. Move it anywhere as needed. PHOTOS: Drag in a picture placeholder, size it first, click in it and insert a photo from the menu. TABLES: You can copy and paste a table from an external document onto this poster template. To adjust the way the text fits within the cells of a table that has been pasted, right-click on the table, click FORMAT SHAPE then click on TEXT BOX and change the INTERNAL MARGIN values to 0.25. Modifying the color scheme To change the color scheme of this template go to the DESIGN menu and click on COLORS. You can choose from the provided color combinations or create your own.

©  2013  PosterPresenta/ons.com          2117  Fourth  Street  ,  Unit  C          Berkeley    CA    94710          [email protected]  

Student discounts are available on our Facebook page. Go to PosterPresentations.com and click on the FB icon.

•  Despite the proliferation of multi-core, multi-threaded systems single-thread performance is still an important processor design goal

•  Modern programs do not lack instruction level parallelism (ILP) •  Real challenge: exploit implicit parallelism without undue costs •  One effective approach: Decoupled look-ahead

Mo2va2on  

Baseline  Decoupled  Look-­‐ahead  Architecture  

•  Look-ahead binary (skeleton) offers more parallelism because certain dependencies are removed during slicing for skeleton

•  Look-ahead is more error-tolerant due to lack of correctness constraint •  Can ignore occasional dependence violations •  Little to no support needed, unlike in conventional TLS

•  Not all instructions are equally important and critical for the final outcome; Plenty of weak instructions are present in a typical program

•  Weak instructions can be removed safely from the look-ahead thread to speedup the look-ahead agent without degrading the quality of it

Look-­‐ahead  Accelera2on  via  Weak  Dependence  Removal   Performance  Benefits  of  Self-­‐tuned  Look-­‐ahead    •  Speedup over baseline decoupled look-ahead: 1.16x •  Speedup over single-thread baseline: 1.78x

Summary  and  Insights  •  Decoupled look-ahead can uncover significant implicit parallelism •  Look-ahead thread often becomes a new bottleneck •  Fortunately, look-ahead lends itself to various optimizations due to

lack of hard correctness constraints •  Speculative parallelization is more beneficial in look-ahead thread

compared to main program thread due to increased parallelism •  Weak instructions can be removed w/o affecting look-ahead quality •  Intelligent look-ahead technique is a promising solution in the era of

flat frequency and modest microarchitecture scaling

References    [1] A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Int’l Symp. On Microarch., Nov 2008. [2] A. Garg, R. Parihar and M. Huang. Speculative Parallelization in Decoupled Look-ahead. In Proc. Int’l Conf. On Parallel Arch. and Compilation Techniques, Oct 2011. [3] R. Parihar and M. Huang. Accelerating Decoupled Look-ahead via Weak Dependence Removal. (Submitted), May 2013.

•  The look-ahead thread (skeleton) runs on a separate core and maintains its memory image in local L1, no writeback to shared L2

•  Look-ahead thread sends execution based branch outcome hints through FIFO queue; also helps prefetching in the shared L2 cache

Figure: In right half applications slower look-ahead thread is the bottleneck which slows down the overall decoupled look-ahead system; Number shown on top of each bar is the potential which can be achieved by

speeding up the slow look-ahead thread

Figure: Speedup of baseline look-ahead and speculatively parallel look-ahead over single-thread baseline

•  Look-ahead thread is a self-reliant entity, independent of main thread that entails low management overhead on the main thread

•  No need for quick spawning and register communication support •  Natural throttling to prevent runaway prefetching and cache pollution

0DLQ�&RUH

%UDQFK�4XHXH/RRN�DKHDG�&RUH

/�� /��

([HFXWHV�/RRN�DKHDGWKUHDG

([HFXWHV�0DLQWKUHDG

/��

5HJLVWHU�VWDWH�V\QFKURQL]DWLRQ

3UHIHWFKLQJ�KLQWV

%UDQFK�SUHGLFWLRQ�

�DGGT�Y���Y���Y�QRS������EJW�D����[�����I�D�VXET�Y���W���D�

DGGT�Y���Y���Y�VXET�Y���W���D�FPRYJH�D���D���Y�DGGT�Y���Y���Y�VXET�Y���W���D�FPRYJH�D���D���Y�VXET�D����[���D�DGGT�Y���Y���Y�EJW�D����[�����I�D�VXET�Y���W���D�

)LJXUH��%DVHOLQH�'HFRXSOHG�/RRN�DKHDG�6\VWHP

Experimental  Setup  

Look-­‐ahead  Thread:  A  New  BoKleneck  

Prac2cal  Advantages  of  Decoupled  Look-­‐ahead    

Raj  Parihar,  Michael  C.  Huang                      {parihar@ece,  michael.huang@}rochester.edu  Advanced  Computer  Architecture  Laboratory  (ACAL),  University  of  Rochester,  Rochester,  NY  14627    

Accelera2ng  Decoupled  Look-­‐ahead  to  Exploit  Implicit  Parallelism    

*HQHWLF�$OJRULWKP�3URJUDP

%LQDU\�3DUVHU

)LWQHVV�(YDOXDWRU

,QLWLDO�6HHGV

/DXQFK�ILWQHVV�WHVW

&ROOHFW�ILWQHVVVFRUH

5HPRYH�*HQHV

)HHG�VNHOHWRQ

1RWLI\�+*$

� �

+LJK�(QG�6HUYHU��H�J���%OXHKLYH�

)LJXUH��+\EULG�*HQHWLF�$OJRULWKP�)UDPHZRUN�

/RFDO�:RUNVWDWLRQ�H�J���$FDOVUY�

6HFXUH�6KHOO��66+�VFS��TVXE��HWF�

Look-­‐ahead  Accelera2on  via  Specula2ve  Paralleliza2on  

Acknowledgements    NSF (Grant # CCF-0747324), NSFC (Grant # 61028004), Alok Garg

Self-­‐tuned  and  Specula2vely  Parallel  Look-­‐ahead    •  In some cases, self-tuned and speculatively parallel look-ahead

techniques are synergistic (ammp, art) •  Self-tuned + Speculative parallel look-ahead speedup: over single

thread baseline – 1.84x; over decoupled look-ahead baseline – 1.20x

Gene2c  Algorithm  based  Framework  •  Genetic algorithm based framework can be reliably used to identify

and eliminate weak instructions from the look-ahead skeleton •  Chromosome creation -> Crossover and mutation -> Natural selection

•  Program/binary and dependence analysis tool: based on ALTO •  Simulator: based on heavily modified SimpleScalar, look-ahead support •  Genetic algorithm framework: a supervisor program written in C/C++

So_ware,  Hardware  and  Run2me  Support  

Experimental  Analysis  and  Results  •  Speedup of decoupled look-ahead over single-thread baseline: 1.53x •  Speedup of speculative parallel look-ahead over single-thread: 1.73x

Figure: ILP limit study of SPEC 2000 INT applications for various instructions windows; Left three bars measure the ILP in a ideal system, whereas right three bars in the presence of realistic branch mispredictions and cache misses

Figure: Speedup of conventional TLS over single-thread, and speedup of speculatively parallel look-ahead over decoupled look-ahead. Speculative parallel look-ahead achieves higher speedup over a more aggressive baseline

Figure: Long-distance parallelism present in skeleton

•  Software support: coarse-grain dependence analysis, finding of target and spawn points, exploitation of loop-level parallelism

•  Hardware support: spawning support for a new thread, value communication through registers, partial cache versioning

•  Runtime support: squash spawned thread if dependence violation occurs

Figure: Available parallelism for 2 core/contexts system

Figure: Examples of weak instructions (dark) in application vpr Figure: Distributions of weak and strong insts

Weak  Dependences:  Opportuni2es  and  Challenges  •  Example of weak instructions: Inconsequential adjustments, Load and

store instructions that are (mostly) silent, Dynamic NOP instructions •  Challenges involved: Context-based, hard to identify and combine –

much like game Jenga, also interact with surrounding instructions

•  Speedup of conventional TLS over single-thread baseline: 1.07x •  Speculative parallel look-ahead over decoupled look-ahead: 1.13x