Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
ISTANBUL TECHNICAL UNIVERSITY ���� GRADUATE SCHOOL OF SCIENCE ENGINEERING AND TECHNOLOGY
COMPUTATIONAL FLUID DYNAMICS ANALYSIS USING A HIGH-ORDER COMPACT SCHEME ON A GPU
Ph.D. THESIS
Bülent TUTKUN
Department of Aeronautics and Astronautics Engineering
Aeronautics and Astronautics Engineering Graduate Programme
NOVEMBER 2012
ISTANBUL TECHNICAL UNIVERSITY ���� GRADUATE SCHOOL OF SCIENCE ENGINEERING AND TECHNOLOGY
COMPUTATIONAL FLUID DYNAMICS ANALYSIS USING A HIGH-ORDER COMPACT SCHEME ON A GPU
Ph.D. THESIS
Bülent TUTKUN (511032001)
Department of Aeronautics and Astronautics Engineering
Aeronautics and Astronautics Engineering Graduate Programme
Thesis Advisor: Prof. Dr. Fırat Oğuz EDİS
NOVEMBER 2012
İSTANBUL TEKN İK ÜNİVERSİTESİ ���� FEN BİLİMLER İ ENSTİTÜSÜ
GPU ÜZERİNDE YÜKSEK-MERTEBE KOMPAKT ŞEMA KULLANILARAK HESAPLAMALI AKI ŞKANLAR D İNAM İĞİ ANAL İZİ
DOKTORA TEZ İ
Bülent TUTKUN (511032001)
Uçak ve Uzay Mühendisliği Anabilim Dalı
Uçak ve Uzay Mühendisliği Programı
Tez Danışmanı: Prof. Dr. Fırat Oğuz EDİS
KASIM 2012
v
Bulent Tutkun , a Ph.D. student of ITU Graduate School of Science, Engineering and Technology student ID 511032001, successfully defended the dissertation entitled “COMPUTATIONAL FLUID DYNAMICS ANALYSIS US ING A HIGH-ORDER COMPACT SCHEME ON A GPU ”, which he prepared after fulfilling the requirements specified in the associated legislations, before the jury whose signatures are below.
Thesis Advisor : Prof. Dr. Fırat Oğuz EDİS .............................. İstanbul Technical University
vii
FOREWORD
First and foremost I would like to offer my sincerest gratitude to my thesis supervisor Prof. Fırat Oğuz Edis who has supported me throughout this study with his invaluable guidance and experience. In addition, I thank all jury members for their directive contributions. I would like to express my great appreciations for Evren Öner for his help whenever I need it. Without him, it would be very formidable for me to tackle some hardware and software issues. I am also grateful to Asst. Prof. Bayram Çelik for his constructive comments on some parts of this dissertation. I wish to thank to my dearest family members starting from my parents Osman and Nazmiye Tutkun; my elder sister Tülay Tutkun for their heartfelt, continuous support. Finally, a special thanks goes to my beloved wife Gülcan for her patience, enthusiasm and love throughout this study. Also, I should mention that required financial support to get the computer system needed for this study is provided by the Istanbul Technical University support programme for postgraduate researches.
April 2012 Bülent TUTKUN
ix
TABLE OF CONTENTS
Page
FOREWORD ............................................................................................................ vii ABBREVIATIONS ................................................................................................... xi LIST OF TABLES .................................................................................................. xiii LIST OF FIGURES ................................................................................................. xv SUMMARY ............................................................................................................ xvii ÖZET ........................................................................................................................ xix 1. INTRODUCTION .............................................................................................. 1
2. GOVERNING EQUATIONS ............................................................................ 7
2.1 2D Advection Equations and 3D Compressible Navier-Stokes Equations ... 7
2.2 3D Incompressible Navier-Stokes Equations ................................................ 9
3. NUMERICAL METHODS .............................................................................. 11
3.1 Temporal Discritization: Runge-Kutta and Adams-Bashforth Method ...... 11 3.2 High-Order Central Compact Finite Difference Scheme ............................ 12
3.3 High-Order Low-Pass Filtering Scheme ..................................................... 23
3.4 Immersed Boundary Method ....................................................................... 26
3.4.1 Continuous Forcing Approach ............................................................. 26
3.4.2 Direct Forcing Approach...................................................................... 28
3.5 Conjugate Gradient Method ........................................................................ 31
4. GENERAL-PURPOSE GPU (GPGPU) COMPUTING ............................... 37 4.1 GPU Computing in CFD ............................................................................. 38
5. GPU ARCHITECTURE AND CODE DESIGN ............................................ 41 5.1 Tesla C1060 and CUDA .............................................................................. 41
5.2 Code Design ................................................................................................ 43 6. APPLICATIONS AND PERFORMANCE COMPARISONS ..................... 49
6.1 Advection of a Vortical Disturbance ........................................................... 50
6.2 Temporal Mixing Layer .............................................................................. 53
6.3 IBM Test Case - Flow Around a Square Cylinder ...................................... 58
7. CONCLUSIONS ............................................................................................... 75 7.1 Future Works ............................................................................................... 77
REFERENCES ......................................................................................................... 79 APPENDICES .......................................................................................................... 87
APPENDIX A: Fourier Error Analysis of Some Central Finite Difference Schemes for First and Second Derivatives ............................................................................ 87
CURRICULUM VITAE .......................................................................................... 91
xi
ABBREVIATIONS
BLAS : Basic Linear Algebra Subroutines CFD : Computational Fluid Dynamics CG : Conjugate Gradient COO : Coordinate Format CPU : Central Processing Unit CSR : Compressed Sparse Row Format CUBLAS : Basic Linear Algebra Subroutines for CUDA CUDA : Compute Unified Device Architecture CUSPARSE : Sparse Matrix Library for CUDA DNS : Direct Numerical Simulation FFT : Fast Fourier Transform GFLOPS : Giga Floating-Point Operations per Second GPGPU : General-Purpose Computing on Graphical Processing Units GPU : Graphical Processing Unit IBM : Immersed Boundary Method IFORT : Intel Fortran Compiler LAPACK : Linear Algebra Package LES : Large Eddy Simulation RHS : Right Hand Side SpMV : Sparse Matrix-Vector Multiplication
xiii
LIST OF TABLES
Page
Table 3.1: Coefficients of Runge-Kutta scheme. ...................................................... 11
Table 3.2: Relations between the coefficients for the first derivative. ...................... 13
Table 3.3: Relations between the coefficients for the second derivative. ................. 15
Table 3.4: Resolution indications of various schemes. ............................................. 19
Table 3.5: Filtering coefficients for various schemes. .............................................. 24
Table 5.1: General specifications of Tesla C1060 [84] ............................................. 41
Table 5.2: Performance gain obtained from the improvements in memory access. . 47
Table 6.1: Absolute errors for the GPU calculations. ............................................... 51
Table 6.2: Required Gflop per time step. .................................................................. 55
Table 6.3: Dimensions of the flow domain. .............................................................. 59
Table 6.4: First node distance from the body. ........................................................... 60
Table 6.5: Comparison of Strouhal numbers for �� = 100. .................................... 67
Table 6.6: Comparison for the variation of Strouhal number with Re ...................... 67
Table 6.7: Comparison of Clrms and Cdave ................................................................. 69
xv
LIST OF FIGURES
Page
Figure 3.1 : Resolution characteristics of central schemes for first derivative. ...... 18 Figure 3.2 : Resolution characteristics of central schemes for second derivative. . 21
Figure 3.3 : Dissipation characteristics for different filters (�� = �. ). ............... 25 Figure 3.4 : Dissipation characteristics of tenth-order filtering. ............................. 25
Figure 3.5 : An arbitrary body immersed in a Cartesian mesh. .............................. 26
Figure 3.6 : Vicinity of the immersed boundary. .................................................... 31
Figure 3.7 : Conjugate gradient method algorithm. ................................................ 32
Figure 3.8 : COO format for any sparse matrix A. ................................................. 33
Figure 3.9 : CSR format for any sparse matrix A. .................................................. 33
Figure 3.10 : Format of ELL for a general sparse matrix A. .................................... 34
Figure 5.1 : A Tesla C1060 GPU ............................................................................ 41
Figure 5.2 : General structure of Tesla C1060 ........................................................ 42
Figure 5.3 : A representation of CUDA structure ................................................... 43
Figure 5.4 : Representation of the CUDA grid. ...................................................... 44
Figure 5.5 : Computation structure for a 3D flow domain...................................... 44
Figure 5.6 : Ideal coalesced access pattern. ............................................................ 46
Figure 5.7 : Non-coalesced access pattern of the code. .......................................... 46
Figure 6.1 : Flow domain. ....................................................................................... 50
Figure 6.2 : Comparison of computed vorticity magnitude with exact solution. ... 51 Figure 6.3 : Performance comparison, calculation time for one time step. ............ 52
Figure 6.4 : Performance comparison, speed up values. ......................................... 52
Figure 6.5 : Performance comparison, time taken per node. .................................. 53
Figure 6.6 : Mixing layer flow domain. .................................................................. 54
Figure 6.7 : Evolution of eddies. ............................................................................. 55
Figure 6.8 : Performance comparison, calculation time for one time step. ............ 56
Figure 6.9 : Performance comparison, speed up values. ......................................... 56
Figure 6.10 : Performance comparison, time taken per node. .................................. 57
Figure 6.11 : Timing breakdown of one time step for 1283 mesh on GPU. .............. 57
Figure 6.12 : Immersed boundary flow domain and boundary conditions. .............. 59 Figure 6.13 : a) Instantaneous streamlines, b) � velocity contours .......................... 61
Figure 6.14 : Instantaneous streamlines in the vicinity of square cylinder for a full aavortex shedding cycle, Re = 100 (above: Present study, below: aSharma aand Eswaran [99]). ............................................................... 62
Figure 6.15 : U velocity and pressure contours for a full vortex shedding cycle ..... 64
Figure 6.16 : Comparison of Lr. (Yoon et al. [104], Sharma and Eswaran [99], aRobichaux et al. [98]) .......................................................................... 65
Figure 6.17 : Power spectrum of Cl (left) and Cd (right) for Re = 100. .................... 66
Figure 6.18 : Variation of Strouhal number with Reynolds number. Mentioned astudies: Sen et al. [102], Singh et al. [101], Sharma and Eswaran [99], aSaha et al. [97], Sohankar et al. [105], Robichaux et al. [98], Sohankar aet al. [96], Franke et al. [106], Davis et al. [107]. ............................... 68
xvi
Figure 6.19 : RMS of lift coefficients for the various Reynolds numbers. ............... 69
Figure 6.20 : Time averaged drag coefficients for the various Reynolds numbers. . 70
Figure 6.21 : Timing breakdown of one time step. ................................................... 70
Figure 6.22 : Timing breakdown of CG iterative solver. .......................................... 71
Figure 6.23 : SpMV performance of CUSPARSE library. ....................................... 72
Figure 6.24 : Performance comparison of CUSPARSE and ELLPACK. ................. 73 Figure 6.25 : Time comparison for the solution of sparse linear systems. ................ 74
Figure 6.26 : Timing breakdown of incompressible solver and CG iterative solver aafter the use of ELLPACK form. ........................................................ 74
xvii
COMPUTATIONAL FLUID DYNAMICS ANALYSIS USING A HIGH-ORDER COMPACT SCHEME ON A GPU
SUMMARY
Computational fluid dynamics (CFD) has made a great progress in the past several decades in attaining the satisfactory simulation results of practical and engineering flow problems. Although, the practical problems involving complex structures or high Reynolds number turbulence still constitute a major challenge, applications of direct numerical simulation (DNS) and large eddy simulation (LES) to various formidable turbulent flowfields present some promising results. However, DNS or LES calculations are obliged to tackle high memory requirements and heavy computational burdens, as a result of this, users prefer generally less accurate turbulence models to DNS or LES. But most of the practical fluid mechanics problems involve a broad spectrum of length and time scale. Therefore, hardware, software and numerical methods should be thought in combination in order to obtain the solutions of these flows with acceptable accuracy and performance. The main purpose of this study is actually to form such a combination for CFD applications. That is, numerical methods capable of representing all or most of the relevant scales are needed for the simulation of these flows. The length scales resolved by a computation are determined by the spatial resolution; moreover, the accuracy in the representation of these scales depends on the numerical scheme. One way to tackle this problem is the use of a high-order method. Therefore, in this study, a sixth-order compact finite difference scheme is exploited for the space discretization in order to obtain the flow field simulations of three test cases. These test cases are the advection of a vortical disturbance (2D advection equations), temporal mixing layer (3D compressible flow equations) and flow around a square cylinder (3D incompressible flow equations). All computations are performed on a GPU produced for the scientific computing purposes. Used GPU is NVIDIA Tesla C1060 that has 30 streaming multi-processors, each containing 8 scalar processors, clocked at 1.3 GHz. It also has a 4 GB GDDR3 global memory at 102 GB/s. In addition to sixth-order compact scheme, a tenth-order low pass filtering scheme is also applied to the solution vector to supress the spurious oscillations for 2D advection equations and 3D compressible equations. Moreover for the time integration, a fourth-order Runge-Kutta scheme is exploited for the advection and compressible test cases, whereas incompressible test case benefits from a second-order Adams-Bashforth scheme for time advancement. Another important numerical method applied in the present study is immersed boundary method used in the test case involving flow around a square cylinder. In the most basic terms, this method allows to perform the computations on a uniform Cartesian mesh by inserting a forcing term into the momentum equation to mimic the solid body. Performance of the GPU computations in terms of the calculation time for sample problems are compared with the computation performance of a CPU. Used CPU for the comparisons is AMD Phenom 2.5 GHz.
xviii
The codes running on the GPU are written with a C-based programming language. For this, Compute Unified Device Architecture (CUDA) toolkit which is a complete software development solution for programming CUDA-enabled GPUs is utilized as the code design medium. In the CPU side, code is written with Fortran programming language.
Sixth-order compact scheme and tenth-order filtering scheme generate tridiagonal in general or cyclic tridiagonal if the boundary conditions are periodic. For the solution of these linear systems LAPACK/BLAS library is utilized for the CPU code, while the GPU code exploits the inverse of the coefficient matrices for the solution of the same linear systems by utilizing the CUBLAS library. CUDA thread structure is arranged in a way that each node in the computational mesh is represented by one thread. Moreover, thread blocks are aligned with the x Cartesian axis, so every thread representing the nodes on the same horizontal mesh line takes place on the same thread block.
Coalesced access to the global memory is an important factor affecting the performance of the computation on GPU. If there are uncoalesced reads or writes in the computation process, it may severely degrade the computation performance. In the present study, coalescing appeared as an issue in some parts of the code especially in the kernels related to x-derivative calculations. Shared memory property of the GPU is utilized in order to alleviate this degradation in the performance. Shared memory which is an on-chip memory and much faster than the global memory is accessible for all the threads in the same thread block. As a result, a performance increase is attained by using the shared memory when required.
In conclusion, with these mentioned implementations, above test cases are solved on a GPU and obtained results are compared with those of a CPU in terms of calculation times. For the 2D advection equations some speedup values up to 10x are achieved for different mesh sizes in comparison to CPU computations. Test case involving the 3D compressible equations requires more computational efforts in comparison with the advection test case and achieved speedup values between 14.5x – 16.5x again for various mesh sizes. In the case of incompressible test problem, a Poisson equation must be solved as a difference from the previous two cases, and it takes nearly 90% of calculation time. Therefore, related to this, a comparison is made for the solution of sparse linear systems between present application and another GPU and CPU implementations in the literature. Therefore, it is seen that GPU perfromance in this study achieves speedup values about 3x and 24x considering other GPU and CPU performances respectively.
xix
GPU ÜZERİNDE YÜKSEK-MERTEBE KOMPAKT ŞEMA KULLANILARAK HESAPLAMALI AKI ŞKANLAR D İNAM İĞİ ANAL İZİ
ÖZET
Hesaplamalı akışkanlar dinamiği, gündelik ve mühendislik akış problemlerinin tatmin edici simülasyonlarına ulaşma konusunda son çeyrek asır içerisinde büyük ilerlemeler kaydetmiştir. Karmaşık yapılar içeren veya yüksek Reynolds sayılı akışlar hala büyük zorluklar çıkarsa da, direkt sayısal benzetim (DNS) ve büyük eddy benzetiminin (LES) çeşitli zorluklar içeren türbülanslı akış alanlarına uygulanması bazı umut verici sonuçlar sunmaktadır. Bununla birlikte, türbülanslı akış analizlerinde DNS ve LES yöntemlerini kullanan araştırmacılar özellikle üç-boyutlu akış alanlarının benzetiminde yüksek bellek gereksinimleri ve ağır hesaplama yükleriyle uğraşmak zorunda kalmaktadır. Bunun sonucu olarak, araştırmacılar genel olarak daha az doğruluk taşıyan modelleri DNS ve LES’e tercih etmektedirler. Bununla beraber, karşılaşılan çoğu pratik mühendislik problemine uzunluk ve zaman ölçekleri açısından bakıldığında, pek çok problemin uzunluk ve zamanın geniş bir spektrumunu içerdiği görülür. Bundan dolayı, bu özelliklere sahip türbülanslı akışların çözümünün kabul edilebilir bir doğrulukla yapılabilmesi için kullanılacak sayısal yöntemin yanısıra, donanım ve yazılım özellikleri de önem kazanmaktadır. Bunun için donanım, yazılım ve sayısal yöntemlerin bir arada düşünülmesi gerekir. Kısacası bu akışların daha yüksek doğrulukla benzetimi için ilgili ölçeklerin tümünü veya çoğunu temsil edebilme kabiliyetine sahip sayısal yöntemler, bu sayısal yöntemlerin kabul edilebilir bir hızla çözülebilmesi için uygun donanımsal bileşenler ve son olarak da bu donanımsal bileşenlere uygun olarak oluşturulmuş yazılımlar gerekmektedir. Bu çalışmanın başlıca amacı, böyle bir kombinasyonu hesaplamalı akışkanlar dinamiği uygulamaları için bir araya getirmektir. Bir hesaplamada temsil edilebilecek uzunluk ölçekleri uzaysal çözünürlük tarafından belirlenir. Ayrıca, bu ölçeklerin temsilindeki doğruluk da kullanılan sayısal şemaya bağlıdır. Bu bağlamda gereken yüksek doğruluğu sağlamanın bir yolu da yüksek mertebeden yöntemlerin kullanılmasıdır. Yüksek mertebeden doğruluk içeren yöntemlerin kullanılması hem oluşturulacak sayısal ağın toplam eleman sayısında azalmaya sebep olacak hem de ulaşılacak sonuçların doğruluğunu yükseltecektir. Bu nedenle, bu çalışmada, üç test probleminin akış alanı benzetimlerini elde etmek amacıyla, uzaysal ayrıklaştırma için bir kompakt sonlu fark şeması kullanılmıştır. Bu problemler, bir girdapsal bozuntunun taşınımı (iki boyutlu taşınım denklemleri), zamansal karışma tabakası (üç boyutlu sıkıştırılabilir akış denklemleri) ve kare silindir etrafında akıştır (üç boyutlu sıkıştırılamaz akış denklemleri). Yüksek mertebeden kompakt şemalar sahip oldukları özellikler sayesinde hesaplamalı akışkanlar dinamiği alanında oldukça popüler hale gelmişdir. Kompakt sonlu fark şemaları uzunluk ölçeği olarak geniş bir spektruma sahip problemlerde standart sonlu fark şemalarına kıyasla daha iyi bir çözümleme sağlarlar. Ayrıca, kompakt şemalar daha küçük şema açıklıkları gerektirmekte ve özellikle yüksek dalga numaralarında daha iyi bir resolüsyon sağlamaktadır. Bu sayede, kompakt şemalar sonlu fark şemalarının iyi özellikleri ile spektral çözüme benzer bir doğruluğu etkili bir biçimde birleştirirler. Bu özellikler
xx
dikkate alınarak, bu çalışmada akış alanlarının analizinde uzaysal ayrıklaştırma için altıncı-merteben kompakt merkezi sonlu fark şeması kullanılmıştır. Bununla birlikte, merkezi şemalar doğal yapılarında herhangi bir disipasyon taşımadıklarından, sıkıştırılabilir Navier-Stokes denklemlerindeki taşınım terimleri gibi non-linear terimlerin benzetiminde muhtemel non-linear kararsızlıklar ortaya çıkabilir. Bundan dolayı, ortaya çıkabilecek bu türden kararsızlıkları bastırabilmek için kompakt şema ile beraber sıkıştırılabilir akış analizlerinde yüksek-mertebeden alçak geçiren bir filtreleme şeması da kullanılmıştır. Bu yaklaşım, orjinal denklemlerin değiştirilmesini gerektiren yapay disipasyon ekleme yönteminden farklı olarak, bir işlem sonrası yöntemi olarak görülebilir. Bu yaklaşımda filtreleme herbir zaman adımından sonra ulaşılan çözüm vektörüne uygulanmaktadır. Diğer bir deyişle, bu çalışmada kullanılan onuncu-mertebe filtreleme şeması korunumlu değişkenler üzerine koordinat yönleri boyunca art arda uygulanmaktadır.
Bu tez kapsamında kullanılan bir diğer sayısal yöntem ise sıkıştırılamaz akış denklemlerinin çözüldüğü test probleminde kullanılan gömülü sınır yöntemidir. Bu yöntem sınırlardan bağımsız olarak herhangi bir akış alanı için Kartezyen Navier-Stokes çözücülerin kullanılabilmesini sağlayan sayısal bir araçtır. En temel haliyle, bu yaklaşım, momentum denklemine katı cismi taklit etmek amacıyla eklenen bir kuvvet terimi aracılığıyla çözümü üniform kartezyen bir mesh üzerinde yapmamıza olanak tanır. Bu yöntemde oluşturulan sayısal çözüm ağının sınırlara uydurulmasına gerek yoktur. Kartezyen bir grid istenildiği gibi oluşturulur ve sınırlar bu ağın içerisine gömülmüş olarak kabul edilir. İstenilen sınır koşulları, gömülü sınır civarında yönetici denklemlere yapılan müdahaleler yardımıyla sağlanır. Bu sayede karmaşık sınırlar etrafında sayısal çözüm ağı oluşturulması sırasında karşılaşılan zorlukların pek çoğu ortadan kalkar. Bu çalışmada, istenilen sınır koşullarını sağlamak için, gömülü sınırlar civarında momentum denklemlerine gerekli kuvvet terimleri eklenmiş ve sınır koşulları sağlanmıştır.
Ayrıca zaman integrasyonu için, yine ilk iki problem kapsamında dördüncü-mertebeden bir Runge-Kutta şeması uygulanırken, sıkıştırılamaz akış içeren problem için zamansal ilerleme ikinci-mertebeden Adams-Bashforth şeması ile sağlanmıştır.
Daha önce de belirtildiği gibi büyük ölçekli, geniş spektrumlar içeren hesaplamalı akışkanlar dinamiği problemlerinin başarılı bir şekilde çözülmesi sadece kullanılacak uygun sayısal yöntemlere değil aynı zamanda bilgisayar sistemlerinin hesaplama gücündeki gelişmelere de bağlıdır. Bu açıdan bakılınca grafik işleme birimlerinin (GPU) aynı zamanda bilimsel hesaplama birimleri olarak da kullanılmaya başlanması bilgisayar sistemleri açısından önemli bir gelişmedir. GPU’ların hesaplama güçlerindeki sürekli artış, son yıllarda bu birimleri genel olarak hesaplamalı bilimler için ciddi bir alternatif haline sokmuştur. GPU’lar yapıları gereği hızlı paralel görüntü işleme yeteneğine sahip olduklarından, barındırdıkları yüzlerce hesaplama çekirdeği yardımıyla ağır hesaplama yükü içeren mühendislik problemlerinde de başarılı sonuçlar elde etmektedirler. Bu çalışmada, tüm hesaplamalar, bilimsel hesaplamalar için üretilmiş bir GPU üzerinde gerçekleştirilmi ştir. Kullanılan GPU, her biri 8 skalar işlemci içeren 30 adet çoklu-işlemciye sahip olan NVIDIA Tesla C1060 GPU’sudur. Taşıdığı işlemcilerin çalışma frekansları 1.3 GHz olup, 102 GB/sn perfromansında veri iletebilen 4 GB’lık GDDR3 global hafızaya sahiptir. GPU hesaplamalarının performansı örnek problemler için çalışma zamanı açısından bir CPU’nun performansı ile karşılaştırılmıştır. Karşılaştırma için kullanılan CPU AMD Phenom 2.5 GHz’lik CPU’dur.
xxi
Geniş ölçekli problemlerin başarılı bir şekilde çözülmesindeki son faktör oluşturulacak yazılımlardır. Bu çalışmada, problemlerin GPU üzerinde etkili bir şekilde çözülebilmesi için bu donanımsal yapıya ve kullanılan sayısal yöntemlere uygun kodlar geliştirilmi ştir. GPU üzerinde koşan kodlar C-tabanlı bir programlama dili ile yazılmıştır. Bunun için, GPU’ların programlanması için geliştirilen bir ortam olan CUDA (Compute Unified Device Architecture) kullanılmıştır. GPU sonuçlarının karşılaştırıldığı CPU’daki kodlar ise Fortran programlama dili ile yazılmıştır.
Altıncı-mertebe kompakt şema ve onuncu-mertebe filtreleme şeması genel olarak üç-bant katsayı matrisleri üretirler, eğer sınır koşulları periyodik ise katsayı matrisleri periyodik üç-bant olmaktadır. Bu lineer sistemlerin çözümü için CPU kodu LAPACK/BLAS kütüphanesini kullanırken, GPU kodu aynı lineer sistemlerin çözümü için CUBLAS kütüphanesi aracılığı ile katsayı matrislerinin tersini kullanmaktadır. CUDA thread yapısı hesaplama ağındaki herbir düğümün bir thread tarafından temsil edilceği şekilde düzenlenmiştir. Ayrıca thread blokları x Kartezyen ekseni boyunca yer almaktadır, böylece aynı yatay mesh hattı üzerindeki düğümleri temsil eden bütün threadler aynı thread bloğunda bulunmaktadır.
Global belleğe sıralı erişim GPU üzerindeki hesaplamanın performansını etkileyen önemli bir faktördür. Eğer hesaplama sürecinde sıralı olmayan bir şekilde okuma veya yazma erişimi var ise, bu durum performansı ciddi bir şekilde düşürebilir. Bu çalışmada da, GPU kodunun bazı bölümlerinde, özellikle x yönündeki türevlerle alakalı olan kernellerde bazı sıralı olmayan erişim sorunları çıkmıştır. Bu sorunu hafifletmek amacıyla, GPU’nun paylaşımlı bellek özelliği kullanılmıştır. Çip üzerinde olan ve global bellekten çok daha hızlı olan paylaşımlı belleğe aynı thread bloğu içinde yer alan tüm threadler ulaşabilmektedir. Bunun sonucunda, gerektiğinde paylaşımlı bellekten yararlanılarak performansta bir artış sağlanmıştır.
Sonuç olarak, yukarıda bahsedilen test prtoblemleri belirtilen uygulamalar ile bir GPU üzerinde çözülmüş ve elde edilen sonuçlar bir CPU ile hesaplama zamanı açısından karşılaştırılmıştır. İki boyutlu taşınım denklemlerinin çözümünü içeren problem için, farklı sayısal ağ büyüklükleri için CPU’ya kıyasla 10 kata varan hızlanmalar sağlanmıştır. Üç boyutlu sıkıştırılabilir denklemlerin çözümü taşınım problemine göre daha yoğun bir hesaplama yükü taşıdığından, yine farklı ağ büyüklükleri için ulaşılan perfromans artış değeri 14.5-16.5 arasındadır. Sıkıştırılamaz akış problemi durumunda ise, önceki iki test probleminden farklı olarak bir Poisson denklemi çözülmekte ve bu çözüm hesaplama zamanının yaklaşık %90’ını almaktadır. Bu yüzden, bu durumla ilgili olarak, performans karşılaştırması seyrek lineer sistemlerin çözümleri için şimdiki uygulama ve başka GPU ve CPU uygulamaları arasında yapılmıştır. Poisson denkleminin GPU üzerinde çözümü için konjuge gradyan iteratif yöntemi kullanılmıştır. Bu iteratif yöntemde en önemli adım seyrek matris-vektör çarpımının yapıldığı adımdır. Bu adım kapsamında seyrek matrisin depolanması ve kullanılması için iki farklı yaklaşım test edilmiştir. Birinci yaklaşım, seyrek matrisi CSR (compressed sparse row) formatında depolamak ve seyrek matris-vektör çarpımını, CUSPARSE isimli seyrek matris işlemleri için oluşturulmuş olan kütüphaneyi kullanarak gerçekleştirmektir. Kullanılan bir diğer yaklaşım ise seyrek matrisi ELLPACK formatında depolamak ve yazılan bir kod yardımıyla seyrek matris-vektör çarpımını gerçekleştirmektir. Bu iki uygulama sonucunda ELLPACK formatının sağladığı performansın daha yüksek olduğu görülmüştür. Bunun sonucunda, bu çalışmada ulaşılan GPU performansının diğer
xxii
GPU ve CPU performanslarına göre yaklaşık olarak sırasıyla 3 ve 24 kat daha iyi olduğu görülmüştür.
1
1. INTRODUCTION
Computational fluid dynamics (CFD) has made a great progress in the past several
decades in attaining satisfactory simulation results of practical and engineering flow
problems. Although, the practical problems involving complex structures or high
Reynolds number turbulence still constitute a major challenge, application of direct
numerical simulation (DNS) and large eddy simulation (LES) to various formidable
turbulent flowfields present certain promising results. Today, there are numerous
commercial CFD packages utilizing different type of discretizations, grid handling
and so on. However, they mostly exploit second-order accurate schemes at most. In
the case of DNS or LES computations, they are obliged to tackle high memory
requirements and heavy computational burdens. As a result of this, most of the
researchers prefer generally less accurate turbulence models to DNS or LES.
Therefore, hardware, software and numerical methods should be thought in
combination in order to obtain the solutions of complex flows with acceptable
accuracy and performance. In this regard, the advancements in the hardware
technology may give a rise to alternative computational platforms and thereby
affordable solutions in the near future for the users in engineering fields. Parallel to
the advancements in the hardware technologies, higher-order numerical methods can
meet the corresponding needs in the numerics. Thus, reached computational power
related to hardware and suitable numerical methods can generate a competent
combination that facilitates the further achievements.
Turbulent or the convection-dominated flows have a broad spectrum of length and
time scales. Therefore, numerical methods capable of representing all or the most of
the relevant scales are needed for the simulation of such flows. The length scales
resolved by a computation are determined by the spatial resolution and the accuracy
in the representation of the scales depends on the numerical scheme. Using finer
grids with the standard second-order methods can meet these requirements.
However, efficiency of the scheme may be derogated by massive memory demand or
numerical error. Low-order numerical methods have some difficulties in simulating
2
vortex dominated flow fields. The main reasons for that are deformation and
dissipation of the vortical flow structures as a result of numerical diffusion in the
solution algorithm. Application of high-order methods may reduce the grid size and
enhance the accuracy. In this respect, compact finite difference methods [1, 2],
nowadays, have become quite popular in computational fluid dynamics (CFD)
applications.
Compact finite difference schemes are generally better in resolving the broad
spectrum of length scales in comparison with standard finite difference scheme of the
same order. Thus, compact schemes require relatively smaller stencil and provide
better resolution especially at higher wave numbers. Therefore, they can combine the
robustness of finite difference schemes and the accuracy of spectral-like solution in
an efficient way. Comprehensive studies focusing on the resolution characteristics of
the higher order compact schemes on a uniform grid were carried out [1, 3]. Compact
schemes are applied in the simulation of the various incompressible and
compressible flow problems [4-13]. In compact finite difference schemes, derivatives
are computed implicitly. These schemes require utilization of not only function
values but also derivative values at the neighboring nodes.
In the present study, a sixth-order central compact finite difference scheme is used in
the simulations of various compressible and incompressible flows. Because the
central schemes do not contain any dissipation inherently, some nonlinear
instabilities can possibly appear during the simulations of nonlinear terms such as the
convective terms in the compressible Navier-Stokes equations. Therefore, to
suppress these instabilities, a high-order low-pass spatial filtering may be a useful
tool [14]. As being different from the explicitly added artificial dissipation that
modifies the original governing equations, filtering actually can be seen as a post
processing technique. It is applied to the solution vector in order to regularize the
features that are captured but poorly resolved. Filtering is implemented on the
conserved variables successively along each of the coordinate directions. Hence in
this study, a tenth-order low-pass filtering scheme is also utilized for the
computations of advection equations and compressible Navier-Stokes equations.
Immersed boundary method (IBM) is a numerical method that also draws attention
of the researchers in recent years. This method is a tool applied to the solution of
flow problems with any boundaries using a Cartesian grid Navier-Stokes solver. That
3
is, computational grid does not need to conform to the boundaries and desired
boundary conditions are imposed by applying different numerical algorithm in the
vicinity of the immersed boundaries [15, 16]. Therefore, this approach considerably
facilitates the difficulties faced during the grid generation of complex boundaries.
Moreover, computational cost per grid node is generally much lower than that of
general purpose unstructured grid solvers. As a result, because of the offered
simplification in the grid generation and other aspects, CFD community recently
takes an interest in this technique that was originally aimed at biological flows. With
IBM, equations may be solved only on a Cartesian grid and there is no need to
reconstruct the grid in the case of moving or deforming bodies. Even though
immersed boundary method was first proposed and implemented by Peskin [17] to
simulate blood flow interacting with the heart, now it is in use for the solution of
various problems such as moving rigid boundaries [18], flapping airfoils [19],
complex flow and heat transfer [20] and acoustic wave scattering [21]. In his method,
Peskin used Lagrangian points connected by springs to simulate the elastic boundary.
Later, the method was adapted to simulate rigid boundaries by increasing spring
rigidity [22] or using a feedback control [23, 24]. However, these methods had heavy
time-step restrictions due to the stiffness of the system. In order to avoid this
restriction, Mohd-Yusof [25] first introduced direct forcing approach and imposed
the velocity boundary condition directly. Afterwards, this method was implemented
by many researchers [26-34]. In direct forcing IBM, solid body is immersed in the
grid with the use of an added forcing term to the governing equation. Therefore,
velocity boundary condition on the immersed body is satisfied with the help of the
mentioned forcing term which is calculated from the algebraic equations in the
discretized problem. An application of a direct forcing IBM method for the solution
of incompressible Navier-Stokes equations is also presented in this study.
As mentioned before, the increase in the computational power of computer systems
is one of the most prominent factors for the successful applications of large scale
CFD problems. Emergence of graphical processing units (GPU) as a scientific
computing platform in recent years is a good example of those advancements. At the
beginning, they’ve provided service with continuously increasing performance to the
gamers in various simulations regarding the gaming content. The continuous increase
in computational power of GPUs has widened their use from gaming to applications
4
with heavy computational burden in the field of computational sciences. GPUs, with
their many core structures and outstanding processing performance, provide a serious
opportunity for scientific computations. GPUs are specifically designed to be
extremely fast at processing large graphics data sets for rendering tasks. Since these
units have a parallel computation capability inherently, they can provide fast and low
cost solutions to engineering problems with heavy computational burden. Numerous
applications from different engineering fields that exploit parallel performance of
GPUs can be found in literature [35]. In the field of CFD, some promising
implementations are also realized with different flow field simulations. One example
is in [36] which were one of the early applications in compressible fluid flows.
Brandvik and Pullan [37, 38], using GPU computing, carried out 2D and 3D
simulations of Euler equations and made performance comparisons between GPU
and CPU. An hypersonic flow simulation on a GPU platform may be found in the
research article of Elsen et al.[39]. They used an NVIDIA 8800GTX GPU and
simulated a hypersonic vehicle in cruise at Mach 5 to demonstrate the capabilities of
GPU code. They achieved a speedup of 20x for the Euler equations. An
implementation for 3D unstructured solution of an inviscid, compressible flow is
presented in [40]. Apart from these, some recent CFD applications can be seen in
[41-48].
In the context of this thesis study, aforementioned numerical techniques, -high-order
compact scheme, low-pass filtering and immersed boundary method- are applied to
the GPU with the purpose of benefiting from the immense parallel computational
power of the GPU. For this, a Tesla C1060, one of the NVIDIA’s scientific
computing GPUs, is used as the compute device. Compute Unified Device
Architecture (CUDA) toolkit which is a complete software development solution for
programming CUDA-enabled GPUs is utilized as the code design medium. Both
2D/3D compressible and incompressible Navier-Stokes equations for various test
cases are solved. These test cases are the advection of a vortical disturbance (2D
advection equations), temporal mixing layer (3D compressible equations) and flow
around a square cylinder (3D incompressible equations). Moreover, IBM is
incorporated into incompressible test case.
Detailed description of the numerical techniques and application processes for the
test cases are presented in the corresponding chapters of the thesis. Structure of the
5
thesis is organized as follows: In Chapter 2, the governing equations for the test cases
are given. Then, Chapter 3 may be referred to see a broader description of the applied
numerical methods. Chapter 4 gives the information about general purpose GPU
computing (GPGPU) by focusing mostly on CFD applications. GPU computational
architecture and some detailed information about the developed code are exhibited in
Chapter 5. Implementations and performance comparisons of them for the simulated
test problems are presented in Chapter 6. Finally, concluding remarks take place in
Chapter 7.
7
2. GOVERNING EQUATIONS
The governing equations solved for the three application problems are given in the
following sections. These are 2D advection equations, 3D compressible Navier-
Stokes equations and finally 3D incompressible Navier-Stokes equations.
2.1 2D Advection Equations and 3D Compressible Navier-Stokes Equations
2D advection equations are used for the first problem of advection of a vortical
disturbance.
� �� + � �� + � � �� = 0
(2.1) ���� + ���� + � ���� = 0 In these equations, and � are corresponding velocity components in x and y
directions respectively.
For the second problem, a temporal mixing layer, 3D Navier-Stokes equations in
conservation form are solved. The entire system of equations is:
∂U∂t + ∂F∂x + ∂G∂y + ∂H∂z = 0 (2.2)
Solution vector and flux vectors are:
� =�� �!
"" "�"#" $� + %&2 ()�
*�+ (2.3)
8
, =��� ��!
" " & + - − /00"� − /01"# − /02" $� + %&2 ( + - − 3 �4�� − /00 − �/01 − #/02)��
*��+
(2.4)
5 =�� �!
"�" � − /10"�& + - − /11"#� − /12" 6� + 78& 9 � + -� − 3:;:1 − /10 − �/11 − #/12)�*�+
(2.5)
< =�� �!
"#" # − /20"�# − /21"#& + - − /22" 6� + 78& 9# + -# − 3:;:2 − /20 − �/21 − #/22)�*�+ (2.6)
/=> = ? $� @��A +� A��@ − 23C@A � D��D( (2.7)
Here, , � and # represent Cartesian velocity components in x, y, z directions. ", -
and 4 denote density, static pressure and static temperature respectively, while ?, 3, �
and % denote molecular viscosity, thermal conductivity, internal energy and velocity
magnitude. /@A and C@A represent the viscous stress tensor and the Kronecker delta
respectively.
In application problems, periodic boundary conditions are applied in all directions
for both 2D advection equations and 3D Navier-Stokes equations. If F = 1 and
F = G0 are boundary point on the left and right boundary respectively, then the
relevant substitutions for points beyond the boundaries take the following form. H is
any flow variable defined as periodic on the flow domain.
HI = HJK and HLM = HJKLM for the left boundary.
HJKNM = HM and HJKN& = H& for the right boundary.
9
2.2 3D Incompressible Navier-Stokes Equations
For the simulation of third test case, flow around a square cylinder, incompressible
Navier-Stokes equations are applied. These equations are:
� �� + ���� + �#�O = 0 (2.8)
" P� �� + � �� + � � �� + # � �OQ = −�-�� + ? $�& ��& + �
& ��& + �& �O&(
" P���� + ���� + � ���� + # ���OQ = −�-�� + ? $�&���& + �
&���& + �&��O&(
" P�#�� + �#�� + � �#�� + # �#�O Q = −�-�O + ? $�&#��& + �
&#��& + �&#�O& (
(2.9)
The same nomenclature in the previous section is also valid for these equations.
Boundary conditions applied for these equations in y and z directions are periodic
similar to the conditions mentioned in the previous sub-section 2.1. For the x
direction, on the left boundary Dirichlet conditions for Cartesian velocity
components and a Neumann condition for pressure are defined in the following form.
= �R � = 0 # = 0 �-/�� = 0
For the right boundary, Neumann conditions for Cartesian velocity components and a
Dirichlet condition for pressure are defined in the following form.
� ��⁄ = 0 �� ��⁄ = 0 �# ��⁄ = 0 - = 0
11
3. NUMERICAL METHODS
3.1 Temporal Discritization: Runge-Kutta and Adams-Bashforth Method
A fourth-order low storage modified Runge-Kutta method [49] is implemented to
advance the solution from time step n to n+1 in the solution of advection equations
and compressible Navier-Stokes equations. It involves four sub-stages to reach time
step n+1. Formulation of the advancement from sub-stage m to sub-stage m+1 is
given by:
UVNM = UV + WV∆�YZ[V\] = 1, 2, 3, 4 (3.1)
Here YZ[\ is the discrete representation of the all terms other than the term including
time derivative. In addition, WV shows the cooefficients of the Runge-Kutta method.
These coefficients are given in Table 3.1.
Table 3.1 : Coefficients of Runge-Kutta scheme.
m 1 2 3 4
_` 14
13 12 1
First advantage of using Runge-Kutta scheme comes from its explicit nature. As a
result of that, moreover, it is also easy to program and not so demanding in terms of
memory requirements. Furthermore, as stated in [49], Runge-Kutta schemes possess
better stability criteria than comparable explicit schemes. However, as a negative
side of it, since the same derivative computations must be performed in all sub-stages
of the scheme, for example four times for the present scheme, it creates an extra
computational burden.
Additional computations in sub-stages that must be done for the Runge-Kutta scheme
makes it highly restrictive in the case of incompressible flow conditions, because
solving a Poisson equation is rather time consuming process. Thus, for the
incompressible flow application, another scheme is applied for time integration,
second-order Adams-Bashforth scheme [50]. This scheme is also explicit and easy to
12
program. This scheme is not a multi-stage scheme, but it is multi-point. Unlike
Runge-Kutta, it requires derivative computations only once for one time step.
Nonetheless, being multi-point causes a disadvantage. That is, it can not start the
computation by itself with the initial conditions, since it requires data from the point
prior to the current one. Another method is needed to get computation started.
Formulation related to this method for the time step n+1 may be seen below:
UJNM = UJ + ∆� a32 YZ[J\ − 12 YZ[JLM\b (3.2)
As it is seen in the formulation this method, unlike Runge-Kutta method, uses the
results of previous two time steps n-1 and n to get to the time step n+1.
3.2 High-Order Central Compact Finite Difference Scheme
A high-order central compact finite difference scheme is applied throughout this
study for spatial discretization of the governing equations given in chapter 2.
Compact finite difference schemes provide more accuracy than the standard finite
difference schemes. In order to achieve this, compact schemes mimic the behavior of
spectral methods which couple derivative terms over all grid nodes. In that way,
compact scheme couples the derivatives of the neighboring nodes to reach an
improved accuracy. Lele, in [1], introduced the compact finite difference schemes
and investigated the accuracy features of them. Starting from the Hermitian formula
[51] and using the Taylor series expansions for the terms, we can come up with a
general five-point formulation for the approximation of the first derivative:
cH@L&′ + dH@LM′ + H@′ + dH@NM′ + cH@N&′ = W H@Ne − H@Le6ℎ + h H@N& − H@L&4ℎ + i H@NM − H@LM2ℎ (3.3)
Here, Hand H′ can be any flow variable and its derivative, respectively. ℎ is the grid
spacing in i-direction. Expressions for the relations of d, c, i, handW can be found
by matching of the coefficients obtained by the substitution of the Taylor series
expansions. Ultimately, attained expressions between the coefficients may be seen in
Table 3.2.
13
Table 3.2 : Relations between the coefficients for the first iderivative.
Second order i + h + W = 1 + 2d + 2c
Fourth order i + 2&h + 3&W = 23!2! Zd + 2&c\ Sixth order i + 2nh + 3nW = 25!4! Zd + 2nc\ Eight order i + 2ph + 3pW = 27!6! Zd + 2pc\ Tenth order i + 2rh + 3rW = 29!8! Zd + 2rc\
Depending on the d and c values, this scheme gives a tridiagonal or a pentadiagonal
systems. In the case of periodic conditions for the dependent variable, these systems
take the forms of cyclic-tridiagonal and cyclic-pentadiagonal. Various schemes with
different order of accuracies for these two forms are presented in [1]. As mentioned
before, pentadiagonal schemes are constructed for the case of c ≠ 0. Generally
fourth-order three-parameter (d, c, W) group is:
i = 13 Z4 + 2d − 16c + 5W\,h = 13 Z−1 + 4d + 22c − 8W\ (3.4)
Schemes, having sixth-order accuracy, involve two parameters (d, c):
i = 16 Z9 + d − 20c\h = 115 Z−9 + 32d + 62c\W = 110 Z1 − 3d + 12c\ (3.5)
A one parameter, eight-order pentadiagonal scheme can be generated by applying
c = M&I Z−3 + 8d\ in (3.5). This eight-order group is:
c = 120 Z−3 + 8d\i = 16 Z12 − 7d\ h = 1150 Z−183 + 568d\W = 150 Z−4 + 9d\
(3.6)
14
Again by applying d = M& in (3.6) a tenth-order scheme is obtained:
d = 12 c = 120 i = 1712 h = 101150 W = 1100 (3.7)
Then, we come to the tridiagonal schemes in the case of c = 0. Moreover, if the
condition W = 0 is also applied, a group of one parameter, fourth-order tridiagonal
schemes is generated:
c = 0i = 23 Zd + 2\h = 13 Z4d − 1\W = 0 (3.8)
If d is taken as 0, then this scheme is transformed into well-recognized explicit
fourth-order central difference scheme. In a similar way, standard Pade scheme is
obtained in the case of = Mn . In addition to this and especially more importantly for
this study, if d is equal to Me then the leading order truncation error coefficient
becomes equal to zero and the scheme changes into a sixth-order scheme. Relevant
coefficients are:
d = 13 c = 0i = 149 h = 19 W = 0 (3.9)
Throughout this study, first order spatial derivatives in the governing equations are
approximated by applying this sixth-order compact central finite difference scheme.
The second order derivatives in the viscous fluxes of compressible Navier-Stokes
equations are also approximated by using (3.9) two times successively.
Compact schemes for the second derivative can be formed in a similar way. Again
we can start with an expression including the second derivatives and function values
of neighboring points. This relation form is also similar to the one for the first
derivatives:
cH@L&′′ + dH@LM′′ + H@′′ + dH@NM′′ + cH@N&′′= W H@Ne − 2H@ + H@Le9ℎ& + h H@N& − 2H@ + H@L&4ℎ& + i H@NM − 2H@ + H@LMℎ& (3.10)
In the same way, obtained constraints for d, c, i, handW by matching the Taylor
series coefficients are presented Table 3.3.
15
Table 3.3 : Relations between the coefficients for the isecond derivative.
Second order i + h + W = 1 + 2d + 2c
Fourth order i + 2&h + 3&W = 4!2! Zd + 2&c\ Sixth order i + 2nh + 3nW = 6!4! Zd + 2nc\ Eight order i + 2ph + 3pW = 8!6! Zd + 2pc\ Tenth order i + 2rh + 3rW = 10!8! Zd + 2rc\
If c and W are taken as equal to 0, a one parameter group of fourth-order schemes is
attained:
c = 0i = 43 Z−d + 1\h = 13 Z10d − 1\W = 0 (3.11)
Regarding this group, if d = 0 again this scheme is transformed into well-recognized
fourth-order central difference scheme. In addition, a sixth-order tridiagonal compact
scheme is obtained for d = &MM. Coefficients of this scheme are seen below:
d = 211 c = 0i = 1211 h = 311 W = 0 (3.12)
In this study, second-order spatial derivatives of incompressible Navier-Stokes
equations are approximated by applying this sixth-order compact scheme. Some
other combinations of d, c, i, h, W and corresponding schemes for both first and
second derivativesmay be found in literature, especially in [1].
Implemented fourier analysises to investigate the resolution characteristics of the
various finite difference schemes may be found in the literature, for example in [1,
51, 52]. Below is an example of such an analysis.
Therefore, as mentioned, a Fourier error analysis may be performed in order to
compare the resolution characteristics of various schemes. For this purpose, an
16
arbitrary periodic function in the form of �@D0 which is decomposed of Fourier
components, can be utilized. Here, k and i are the wave number and imaginary
number respectively. Then we can take;
HZ�\ = H = �@D0 (3.13)
and its derivative is:
P�H��Q = F3�@D0 = F3H (3.14)
If we think of a finite difference operator C0 then,
C0HA = F3∗�@D0w = F3∗HA (3.15)
and
�A = x∆� (3.16)
Therefore, if we take second-order central finite difference scheme below,
P�H��QA =HANM − HALM2∆� (3.17)
inserting the derivative expressions,
F3∗HA = �@DZANM\∆0 − �@DZALM\∆02∆�
(3.18)
= �@D∆0 − �L@D∆02∆� HA remembering the Euler formula for the complex exponential function,
�@D∆0 = cosZ3∆�\ + F|FGZ3∆�\ (3.19)
We can write the expression (3.18) as,
F3∗ = cosZ3∆�\ + F|FGZ3∆�\ − cosZ3∆�\ + F|FGZ3∆�\2∆� (3.20)
17
and we can write the modified wave number expression as below.
3∗∆� = |FGZ3∆�\ (3.21)
Similarly, if we take the fourth-order central finite difference scheme as second
scheme to be analyzed,
P�H��QA =−HAN& + 8HANM − 8HALM + HAL&12∆� (3.22)
F3∗HA = −�@&D∆0 + 8�@D∆0 − 8�L@D∆0 + �L@&D∆012∆� HA (3.23)
again by using Euler formula, we can come up with an expression for the modified
wave number:
3∗∆� = −|FGZ23∆�\ + 8|FGZ3∆�\6 (3.24)
This Fourier error analysis can be performed for compact schemes as well. In this
case, following expression should be defined for the implementation on the compact
schemes.
C0HANJ = F3∗�@JD∆0�@D0w = F3∗�@JD∆0HA (3.25)
If we take the sixth-order tridiagonal central compact finite difference scheme as an
example, after the some arrangements scheme is slightly changed into the following
form,
HALM} + 3HA} + HANM} = 143 $HANM − HALM2∆� ( + 13$HAN& − HAL&4∆� ( (3.26)
inserting the corresponding expressions into the scheme,
F3∗HA~�L@D∆0 + 3 + �@D∆0�= �143 $�
@D∆0 − �L@D∆02∆� ( + 13$�@&D∆0 − �L@&D∆04∆� (� HA (3.27)
18
With the aid of the Euler formula again, we can attain the desired modified wave
number expression,
3∗∆� = 28|FGZ3∆�\ + |FGZ23∆�\18 + 12W�|Z3∆�\ (3.28)
In consideration of the obtained modified wave number expression for the different
central finite difference schemes, it is seen that modified wave number has only real
component when it comes to central difference schemes. Generally modified wave
number is a complex number. Since exact wave number is actually real, modified
wave number specifies the error characteristic that differencing scheme may exhibit
inherently. Therefore, difference between the real parts of the exact and modified
wave numbers gives the dispersion errors resulting from odd derivatives in the
truncation error. In addition, difference between the imaginary parts of the exact
(actually zero) and the modified wave number shows the dissipation errors resulting
from the even derivatives in the truncation error. Figure 3.1 presents a modified wave
number plot for the six different central schemes. Among these schemes, three of
them are already analyzed above and corresponding modified wave number
expressions are obtained. Modified wave number expressions of remaining three
schemes are presented in Appendix A.
Figure 3.1 : Resolution characteristics of central schemes for first derivative.
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5 3
k*∆
x
k∆x
exact
2nd central
4th central
4th pade
6th tri. comp.
8th pen. comp.
10th pen. comp.
19
Table 3.4 also gives another aspects of resolution characteristics presented in Figure
3.1. The given quantities are also for the same six central schemes indicated in the
modified wave number plot. The first quantity is the 3�∗ which is the maximum wave
number resolved accurately enough by a scheme. The limit specifiying the 3�∗can be
taken as |3∗∆� − 3∆�| < 0.01. That means wave numbers greater than 3�∗ can not be
resolved accurately. Second quantity shows the number of points per wavelength
(PPW) in order to be successfull in resolving a certain mode. It is defined as,
��� = 2�3�∗∆� (3.29)
Table 3.4 : Resolution indications of various schemes.
Scheme ��∗∆� PPW
2nd central 0.35 18
4th central 0.75 8.4
4th Pade 1.05 6
6th tri. comp. 1.5 4.2
8th pen. comp. 1.75 3.6
10th pen. comp. 2 3.1
With these PPW values, for instance, we can say that second-order scheme needs
more than four times as many points in order to obtain the accuracy level of the
sixth-order compact scheme.
Fourier analysis of the schemes for second derivative may be performed similar to
the analysis for the first derivative. Again we can define a function of the form;
HZ�\ = H = �@D0 (3.30)
and its second derivative is:
20
$�&H��&( = −3&�@D0 = −3&H (3.31)
If we think of a finite difference operator C00 then,
C00HA = −3∗&�@D0w = −3∗&HA (3.32)
and
�A = x∆� (3.33)
So, if we consider the same schemes as in the analysis for the first derivative, we can
start with the second-order central finite difference scheme shown below;
$�&H��&(A =HANM − 2HA + HALM∆�& (3.34)
by using above definition,
−3∗& = �@D∆0 − 2 + �L@D∆0∆�& (3.35)
With the Euler formula for the complex exponential function, modified value can be
obtained as:
3∗&∆�& = 2 − 2W�|Z3∆�\ (3.36)
Similarly, we can make the analysis for the fourth-order central scheme as follows:
$�&H��&(A =−HAN& + 16HANM − 30HA + 16HALM − HAL&12∆�& (3.37)
−3∗& = −�@&D∆0 + 16�@D∆0 − 30 + 16�L@D∆0 − �L@&D∆012∆�& (3.38)
3∗&∆�& = 15 + W�|Z23∆�\ − 16W�|Z3∆�\6 (3.39)
And for the sixth-order compact finite difference scheme:
21
HALM}} + 112 HA}} + HANM}} = 122 $HANM − 2HA + HALM∆�& ( + 32$HAN& − 2HA + HAL&4∆�& ( (3.40)
−3∗& P�L@D∆0 + 112 + �@D∆0Q= 122 $�
@D∆0 − 2 + �L@D∆0∆�& ( + 32$�@&D∆0 − 2 + �L@&D∆04∆�& (
(3.41)
3∗&∆�& = 48 − 48W�|Z3∆�\ + 3 − 3W�|Z23∆�\22 + 8W�|Z3∆�\ (3.42)
As in the analysis for the first derivative, results of the Fourier analyses for the
fourth-order Pade, eight-order pentadiagonal compact and tenth-order pentadiagonal
compact schemes are presented in Appendix A. Therefore, obtained results are
plotted in Figure 3.2.
Figure 3.2 : Resolution characteristics of central schemes for second derivative.
0
1
2
3
4
5
6
7
8
9
10
0 0.5 1 1.5 2 2.5 3
k*
2∆
x2
k∆x
exact
2nd central
4th central
4th Pade
6th tri. comp.
8th pen. comp.
10th pen. comp.
22
So from the comparisons between some standard finite difference schemes and
compact schemes presented in Figure 3.1 and Figure 3.2, it may be seen that second
and fourth-order central schemes exhibit a considerable discrepancy from exact
differentiation on most of the wavenumber spectrum. Standard Pade scheme does
much better than the other standard schemes. Thus, all compact schemes give better
results than the standard ones. Moreover, as expected eight and tenth-order compact
schemes have better resolution characteristics than the sixth-order one. As mentioned
before, sixth-order compact central scheme is applied as the spatial approximation
method for the derivatives. One reason for that, as stated in [1], the improved
resolution properties of the schemes lead to the possibility of increased aliasing
errors, and in turn, this may require an extra work to suppress these errors. In
addition, using a scheme with the order of accuracy higher than six means to struggle
with coefficient matrices with larger bandwith and larger stencils. In this respect,
some comparisons between compact and standard finite difference schemes in terms
of computational efficiency and accuracy may be found in literature [9, 13, 52, 53].
As a result, sixth-order compact scheme seems as a compromise between opposing
factors. This scheme may be seen below for the first and second derivatives.
13 H@LM} + H@} + 13H@NM} = 19 PH@N& − H@L&4ℎ Q + 149 PH@NM − H@LM2ℎ Q (3.43)
211 H@LM}} + H@}} + 211H@NM}} = 311 PH@N& − 2H@ + H@L&4ℎ& Q + 1211 PH@NM − 2H@ + H@LMℎ& Q (3.44)
For the non-periodic boundaries, one-sided schemes on the boundary are applied,
since the stencil of inner compact scheme extend out of the boundary. In the case that
high-order compact boundary and near-boundary schemes are used with high-order
compact inner schemes, apparence of numerical instabilities is higly probable [54].
Therefore, orders of the used boundary and near boundary schemes are lower than
that of the interior scheme. In addition to these, tridiagonal form of the interior
scheme is preserved by utilizing these schemes. As a result, third-order one-sided
compact scheme and fourth-order central compact scheme are applied for the
boundary point F = 1, and near-boundary point F = 2 respectively. The mentioned
schemes are seen below for the first and second derivatives:
23
1st derivative, F = 1 (3rd order) HM} + 2H&} = 12ℎ Z−5HM + 4H& + He\ (3.45)
1st derivative, F = 2 (4th order) 14 HM} + H&} + 14He} = 34ℎ ZHe − HM\ (3.46)
2nd derivative, F = 1 (3rd order) HM}} + 11H&}} = 1ℎ& Z13HM − 27H& + 15He− Hn\ (3.47)
2nd derivative, F = 2 (4th order) 110 HM}} + H&}} + 110 He}} = 65ℎ& ZHe − 2H& + HM\ (3.48)
If we say this is the left boundary, similar formulations are implemented for the right
boundary at F = � and the near-boundary point at F = � − 1.
3.3 High-Order Low-Pass Filtering Scheme
As mentioned before, high-order compact central schemes are non-dissipative. When
non-dissipative central schemes are used to simulate conservative form of nonlinear
terms, such as the convective terms in the compressible Navier–Stokes equations,
nonlinear instabilities due to aliasing errors can potentially arise. To eliminate
resulting spurious high frequency oscillations from the solution and to ensure
numerical stability, a low-pass spatial filtering procedure [55] is incorporated within
the compact difference scheme. A general expression for the tridiagonal implicit
filters can be given as follows:
d�∅�@LM + ∅@ + d�∅�@NM = � iV2 Z∅@NV + ∅@LV\�V�I
(3.49)
Here the symbol ∅ represents a component of the solution vector and the symbol ∅�
denotes the filtered value of that component. The above compact filter gives a 2Mth-
order accuracy for the 2M+1 point stencil. In this study, tenth-order filtering (M=5)
is applied for attaining the stable solutions. The free parameter d� is chosen such that
−0.5 < d� ≤ 0.5, while to suppress a wider range of high frequency oscillations, a
smaller value of d� can be chosen [14]. If d� is taken as 0, then implicit filtering
formula is transformed into explicit one. iV coefficients in the filtering expression
24
(3.49) are dependent on the free parameter d�. These coefficients are presented in
Table 3.5 for various schemes with the different orders of accuracy.
Table 3.5 : Filtering coefficients for various schemes.
2nd order 4th order 6th order 8th order 10th order
iI 1 + 2d�2 5 + 6d�8
11 + 10d�16 93 + 70d�128
193 + 126d�256
iM 1 + 2d�2 1 + 2d�2
15 + 34d�32 7 + 18d�16
105 + 302d�256
i& 0 −1 + 2d�8
−3 + 6d�16 −7 + 14d�32
15~−1 + 2d��64
ie 0 0 1 − 2d�32
1 − 2d�16 45~1 − 2d��512
in 0 0 0 −1 + 2d�128
5~−1 + 2d��256
i� 0 0 0 0 ~1 − 2d��512
If we make a wave number analysis of this implicit filtering expression, we can come
up with a spectral transfer function ,Z�\ in the form of:
,Z�\ = ∑ iVcosZ]�\�V�I1 + 2d�cosZ�\ (3.50)
This transfer function expression gives the dissipation characteristics of different
schemes. Figure 3.3 presents the mentioned characteristics for various central
filtering schemes in the case of d� = 0.4. The exact expression in this figure
corresponds to the unfiltered values. As seen in the plot, higher-order filters exhibit
less dissipation at lower wave numbers. Figure 3.4 also shows the characteristics of
the tenth-order filtering with differing d�. It may be seen spectral-like behaviour of
filter for d� = 0.49.
25
Figure 3.3 : Dissipation characteristics for different filters (�� = �. ).
Figure 3.4 : Dissipation characteristics of tenth-order filtering.
0
0.2
0.4
0.6
0.8
1
1.2
0 0.5 1 1.5 2 2.5 3 3.5
Tra
nsfe
r fu
nctio
n
Wave number
exact
2nd
4th
6th
8th
10th
0
0.2
0.4
0.6
0.8
1
1.2
0 0.5 1 1.5 2 2.5 3 3.5
tran
sfer
func
tion
wave number
exact
αf= -0.2
αf= 0
αf= 0.2
αf= 0.4
αf= 0.49
26
3.4 Immersed Boundary Method
As noted before, IBM is a very fruitful method that is not restricted by the
complexity and extra computational burden it brings about. In this context, Figure
3.5 gives a representative picture of a solid body immersed in a Cartesian mesh.
Figure 3.5 : An arbitrary body immersed in a Cartesian mesh.
This figure indicates a non-conformal Cartesian mesh, moreover f and b shows the
fluid and body sections of the domain respectively. In this method, there is also a
surface grid covering and identifying the immersed boundary, but the difference is
that the Cartesian grid is generated in a way as if there is no such a surface grid. As a
result of this, cartesian grid runs through the solid body. Since there is not a body
conformal grid, near the immersed boundary some modification to the governing
equations are needed in order to implemet boundary conditions properly. In this
respect, there are two general approaches in IBM related to imposition of boundary
conditions [15]. First one is continuous forcing approach in which forcing takes place
in the continuous equations before any discretization. And the second one is direct
forcing approach in which forcing is implemented after the discretization.
3.4.1 Continuous Forcing Approach
As mentioned before in the introduction, immersed boundary method was first
implemented by Peskin [17] to simulate blood flow interacting with the heart.
Method used by Peskin was applied to elastic boundaries like a beating heart. A
27
group of elastic fibers specifies the immersed boundary and the position of the
boundary is determined by the Lagrangian points moving with the fluid flow. The
equation seen below governs the position of the nth Lagrangian point;
��J�� = �Z�G, �\ (3.51)
In addition, while elastic fibers moves, they actually influence the surrounding fluid
region by imposing a stress to the fluid nodes around. This effect is represented by a
local forcing term in the momentum equation. This forcing term is in the form of:
�VZ�, �\ =��GZ�\CZ|� − �G|\G
(3.52)
Here, � and C denotes the stress and the Dirac delta function respectively. Because
of the fact that nodes of the Cartesian mesh mostly does not coincide with the
position of the fibers, actually this forcing is distributed over the mesh nodes near the
Lagrangian point by utilizing the momentum equations of these nodes. Therefore,
use of the smoother functions instead of the sharp Dirac delta function is prefered
and applications of different distribution functions may be seen in the literature [56-
58]. Implementation of the IBM in this way is suitable for elastic boundaries but
when it comes to rigid bodies, some problems are arising. If a boundary approaches
to the rigid limit, then the above method starts to lose its well-posedness gradually.
In order to overcome this issue some researchers [22, 56] resort to a spring analogy.
In this approach, structure is defined as attached to an equilibrium point via a spring
that has a restoring force shown below;
�JZ�\ = −�Z�J −�J� Z�\\ (3.53)
here � and �J� are spring constant and the equilibrium point for the Lagrangian point
n. The higher � values mean more accurate definition of boundary conditions, but
this also leads to a stiff system and severe stability limitations [59]. In this context,
another approach comes from Goldstein et al. [23] and Saiki and Bringen [24]. Fluid
flow around the immersed boundary is affected by the body through the following
force term;
28
�Z��, �\ = d��Z��, /\�/ + c�Z��, �\�
I (3.54)
Here, �| shows a boundary point. d and c in this expression are negative constants and
can be used to impose the boundary conditions properly. This force also may be
named as feedback forcing, because it behaves in a restoring way according to the
difference between the desired boundary velocity level and the actual one. Moreover,
one of the recent research that is an application of the continuous forcing IBM to the
GPU may be found in [60].
3.4.2 Direct Forcing Approach
Since the integration of the Navier-Stokes equations analytically is not possible in
general, specifiying a forcing function based on this integration also is not a realistic
approach. Therefore, there should be some simplifications on the forcing terms
mentioned before. therefore, in order to circumvent this issue, some researchers [25,
61] suggest another forcing approach that may be called as direct forcing. In this
approach, forcing term is taken directly from the numerical solution. Thus, there is
not any parameter defined by the user, so there is not a stability limitations related to
these parameters also.
In this study, direct forcing IBM is utilized in the solution of incompressible Navier-
Stokes equations. Essence of direct forcing is imposition of the boundary velocity
% on the solution of the velocity for the solid boundary. This imposition is applied
via a force term which is added to the governing equations. If we think of the x-
component of the incompressible Navier-Stokes equations:
� �� = −P � �� + � � �� + # � �OQ − 1" �-�� + ¡ $�& ��& + �
& ��& + �& �O&( + ,0 (3.55)
Here ,0is the forcing term for the IBM and inserted to the equation in order to ensure
the proper boundary velocity value on the surface of solid body. If we simply
discretize this equation in time:
JNM − J∆� = �<¢ + ,0 (3.56)
29
RHS in this equation includes convection, diffusion and pressure terms. Then from
this relation, the forcing value ensuring the velocity boundary condition on the wall
can be written easily as:
,0 = £ $−�<¢ + JNM − J∆� ( (3.57)
In this expression, £ = 1 in the solid body region and £ = 0 everywhere else [53].
Therefore, with the use of forcing term ,0 velocity boundary condition can be
imposed directly on the solid surface. As a result, boundary conditions on the wall
are satisfied at every time step. This is the general owerview of the used IBM. As
mentioned before, time integration is realized with the use of a second-order Adams-
Bashfort method with a fractional step approach for the test case involving IBM. We
can combine the time integration with the governing equations to explain the IBM as
used in this study.
If we write incompressible Navier-stokes equations in vector form:
∇ ∙ ¦ = 0
�¦�� + ¦ ∙ ∇¦ = −1"∇- + ¡∇&¦ + � (3.58)
Here � is the direct forcing vector that is described before. With the use of this
forcing term, required boundary conditions on the solid surfaces are imposed
directly. If both convection and diffusion terms are represented by §, like:
§ = −Z¦ ∙ ∇¦\ + ¡∇&¦ (3.59)
Then, the time advancement of the momentum equation in (3.58) may be given as:
¦∗ − ¦J∆� = 32§J − 12§JLM − 1"∇-J + �JNM (3.60)
¦∗∗ − ¦∗∆� = 1" ∇-J (3.61)
¦JNM − ¦∗∗∆� = −1"∇-JNM (3.62)
30
In this approach, expression for the direct forcing term related to IBM may be given
by:
�JNM = £ $−32§J + 12§JLM + 1"∇-J + ¦ JNM − ¦J∆� ( (3.63)
As mentioned before, £ is defined as 1 in the solid body region and 0 everywhere
else. With the insertion of this term into (3.60), desired velocity boundary condition
in the region prescribed by £ is satisfied in the first step of fractional step method.
Normally in a standard fractional step method, the incompressibility condition
∇ ∙ ¦JNM = 0, may be satisfied with the solution of a Poisson equation in the form
of:
∇ ∙ ∇-JNM = "∆� ∇ ∙ ¦∗∗ (3.64)
When IBM is incorpareted in the computational method, then this equation is
transformed into:
∇ ∙ ∇-JNM = "∆� ∇ ∙ ¨Z1 − £\¦∗∗© (3.65)
When £ = 0 then the standard Poisson equation (3.64) is obtained. However, in the
solid body region, £ = 1, (3.65) gives the Laplace equation. This Poisson equation is
solved with the use of conjugate gradient method.
Another important point in this direct forcing IBM application is to create an internal
flow maintaining the desirable boundary condition, no-slip in this case, at the
cylinder surface. In order to reach this goal, a reverse flow similar to mirror
conditions is imposed to the proper grid nodes inside the cylinder. This approach can
be seen in Figure 3.6.
31
Figure 3.6 : Vicinity of the immersed boundary.
In this figure, F is the any grid direction aligned with the ª (normal to the surface S).
U and P are any velocity components and pressure respectively. As it also may be
seen in Figure 3.6, velocity and pressure values of the first two nodes inside the solid
body are prescribed directly to preserve the velocity (no-slip) and pressure (∂p/∂n=0)
boundary conditions on the surface S. In this approach, first two nodes are taken into
consideration, because stencil width of the used sixth-order compact central finite
difference scheme is also two. If we think of nodes inside the body other than the
first two nodes, direct forcing term is applied for them too. In the context of internal
treatment of the body, there are several ways to implement [26]. However it shold be
noted that outside flow is not principally dependent on the inside flow. One of the
possibilities, also applied in this study, is to impose the direct forcing inside the body
too . Another way is not to impose anything and to leave the interior of the body free
to develop. This leads to a different flow conditions from that of the previous
approach, but exterior flow is not affected from this. And the third way is an
approach in which velocities next to the boundaries are reversed as as used in this
study too. This method is applied to avoid from spurious oscillations near the
boundary [25] when using high-order compact scheme.
3.5 Conjugate Gradient Method
Expression (3.65) constitutes a sparse system that is too large to be solved with direct
methods. Therefore conjugate gradient method which is one of the iterative methods
used for the solution of symmetric and positive difinite systems is utilized [62].
Conjugate gradient iterative algorithm for the solution of system §� = « may be
seen in Figure 3.7.
32
¬I = « − §�I �I is an initial approximate solution vector and ¬I is a residual.
�I = ¬I � is an auxilary variable.
3 = 0 3 is an iteration step indicator.
repeat
D = §�D D is 3�® iteration residual.
dD = ¬D;¬D�D;D dDis a correction factor.
�DNM = �D + dD�� �DNMis the solution at Z3 + 1\�® iteration.
¬DNM = ¬D − dDD ¬DNM is the residual at Z3 + 1\�® iteration. If ¬DNM is small enough, then exit loop and the solution is �DNM.
cD = ¬DNM; ¬DNM¬D;¬D cD is a correction factor.
�DNM = ¬DNM + cD�� �DNM is the auxilary variable at Z3 + 1\�® iteration.
3 = 3 + 1
end
Figure 3.7 : Conjugate gradient method algorithm.
Moreover, as it is seen in the conjugate gradient algorithm, one important step is the
sparse-matrix vector multiplication. In this context, for the application of conjugate
gradient method on the GPU several different storage methods for the sparse matrix
are implemented. In general there are numerous sparse matrix formats that exhibit
distinct characteristics in terms of handling the elements of matrix, storage
requirements and computational features. Looking at some well-known sparse matrix
storage formats may clarify these distinction:
Coordinate Format (COO): This storage structure is quite simple and
straightforward. In other words, this format is a general one in which needed storage
space is always proportional to the nonzero elements for any sparse matrix. This
format uses three arrays which are data, row and column. Each one holds nonzero
33
values, row indices and column indices respectively. That is, row and column indices
are stored explicitly. Figure 3.8 shows this representation for sparse matrix A.
Figure 3.8 : COO format for any sparse matrix A.
Compressed Sparse Row Format (CSR): This is another well-known format for
sparse matrices. It also includes three arrays; data and column arrays store nonzero
values and column indices respectively. And a third array, rowptr, includes a row
pointer showing the beginning of each row in the data and column arrays. If sparse
matrix A has M row, then rowptr array will have M+1 elements and this one
additional element holds the total number of nonzero values in A. Actually, CSR
format is a natural subset of COO format, so transformation between these two
formats is also rather uncomplicated. Figure 3.9 also presents the CSR format.
Figure 3.9 : CSR format for any sparse matrix A.
34
Other than these, there are different storage formats for the sparse matrices differing
according to matrix structure, computational structure and so on.Performance of
sparse matrix-vector multiplication in terms of used storage formats is analyzed in
the literature in several studies [63, 64].
With a different format that may be used instaed of CSR, one can reach higher
performance. For instance, ELLPACK format[63] gives more efficiency than CSR
format because of structure of sparse matrix in this study. In this format, an
important factor is the maximum number of nonzero elements in a row (say K). As a
result nonzero elements of a sparse matrix M by N are stored into an M x K data
array.
Figure 3.10 : Format of ELL for a general sparse matrix A.
Rows not having K nonzero elements are padded with zeros. Likewise, column
indices are hold in another array with similar padding approach. In addition to this, if
the data storing is made in the form of column- major order, then nearly full memory
coalescing is provided on the GPU that gives a high performance in turn. Moreover,
another vector containing the exact number of nonzero elements of each row is also
generated. Therefore, unnecessary operations of padded zeros are prevented with the
incorporation of this additional vector into the algorithm. This format is represented
in Figure 3.10. This figure shows the arrays of data, column indices and nonzero
elements per row for a general sparse matrix A. As a result, considerably higher
35
performance is attained in the operation of sparse matrix-vector multiplication. A
comparison between the CSR version of a GPU library (CUSPARSE) and a
constituted custom kernel utilizing ELLPACK format is given in the section
involving relevant test case .
37
4. GENERAL-PURPOSE GPU (GPGPU) COMPUTING
Until quite recently, microprocessors with a single central processing unit (CPU) has
provided with a constant increase in performance and reduction in cost. During this
period, these microprocessors made the GFLOPS (billion floating-point operations in
second) term ordinary even for the individual desktop computers. Therefore, any
advancement in hardware resulted in corresponding performance gain in almost
every computer applications. This was a native consequence of the trend. However,
at the beginning of 2000s, this trend lost its effect with the appearance of some issues
like energy-consumption and heat generation of the microprocessors. These factors
limited the increase in performance. As a result of this, microprocessor
manufacturers headed towards the chips including multiple processor units (known
as cores). This new approach created a snowball effect in terms of developed
software and applications. Priorly, sequential codes constituted the most of the
applications running in the computers, because there was no need for parallel ones.
This has changed drastically with the appearance of chips with multiple cores. Thus,
instead of sequential programs that may run only on a single core, parallel
applications started heavily to emerge for benefiting the provided performance
improvements in the multi-core structure of the processors.
Since the first emergence of the multi-core processors, the semiconductor industry
has settled on two main trajectories for designing microprocessors [65]. The multi-
core trajectory mainly tries to comprise between execution speed of the sequential
programs and multiple core structure. In contrast, the many-core trajectory focuses
more on the execution throughput of parallel applications [66]. The many-core
structure includes a large number of much smaller cores. For almost a decade, many-
core processors, especially the GPUs, have led the race of floating-point performance
in terms of the raw speed that the execution resources can potentially support. For
example, peak performance of NVIDIA M2050 GPU reached to 1030 and 515
GFLOPS for single and double precision floating-point operations respectively. As a
result, these performance differences unsurprisingly attracted many researchers from
different fields who put in effort to reassess computationally intensive parts of their
38
codes in order to make use of the GPU performance. In the current situation, GPU
computing is applied in quite different fields of science and engineering. To set an
example, GPUs are used in bio-informatics and life sciences [67, 68], in
computational electrodynamics [69], in data mining [70, 71], in medical imaging [72,
73], in molecular dynamics [74, 75] and in computational fluid dynamics.
4.1 GPU Computing in CFD
As mentioned before, CFD is also one of the various research fields that can benefit
from increasing performance and opportunities of GPUs. Memory handling is an
important factor affecting performance of the GPU applications considerably. An
implementation designed to run on the GPU can use various memory types. Global
memory that is much larger than shared, texture and constant memories, has also
high access latency. Conversely, the latter ones are very limited in size. Therefore, to
compare different GPU applications, without taking their features affecting memory
accesses into consideration, may not be so fruitful. These features may be grid type,
variable storage system, discretization method and solution scheme. For instance
used grid, structured or unstructured, specifies the data structure and ultimately
performance increase. If the grid is structured, then memory access is also structured
that means better speed-up values.
It is not surprising that early CFD simulations were generally related to gaming
applications. One of the first CFD implementations, relevant to a smoke simulation,
is conducted in [76] by M. Harris. In that study, two dimensional incompressible
Navier-Stokes equations are solved by using second-order central finite difference
schemes and for the solution of Poisson equation, Jacobi iteration is used because of
its simplicity and easy implementation. In this respect, another work can be seen in
[77] involving solution of three dimensional incompressible Euler equations. Again,
one of the first applications dealing with 2D/3D compressible Euler equations is
realized by Hagen et al. [36] and some 25x and 14x speed-ups compared to the CPU
code is attained for two and three dimensional simulations respectively. Other
performance comparisons between GPU and CPU are made for Euler equations in
[37, 38]. 29x and 16x speed-ups over the CPU codes are reported for two and three
dimensional solutions respectively. In another interesting work, a hypersonic flow
simulation is carried out on the GPU [39] and speed-ups of 40x and 20x are obtained
39
for the simple and complex geometries respectively. Issue of limited precision ( by
that time ) is analyzed in [78] by using a mixed precision iterative refinement
technique. An early implementation of unstructured grid for three dimensional
compressible Euler equations is presented in [40] with the use of cell-centered finite
volume technique. In another unstructured grid study similar to [40] 2D/3D Navier-
Stokes equations are solved by using vertex-centered finite volume technique [47].
Both of these studies carry out single, mixed and double precision calculations for
the purpose of comparisons. Apart from these, there are various CFD studies in the
literature [40-42, 44, 46, 48, 79-81]. The mentioned studies are generally lower-order
accurate in space. Though, there are few studies involving high-order accurate spatial
schemes in the literature [82, 83]. However, to the best of my knowledge, high-order
compact finite difference scheme is not applied to GPU for the simulation of any
fluid flow problem before the present study.
41
5. GPU ARCHITECTURE AND CODE DESIGN
5.1 Tesla C1060 and CUDA
Within the scope of this study, a Tesla C1060 GPU is used as a compute device. The
mentioned GPU may be seen in Figure 5.1.
Figure 5.1 : A Tesla C1060 GPU.
It has 30 streaming multi-processors, each containing 8 scalar processors, clocked at
1.3 GHz. It also has a 4 GB GDDR3 global memory at 102 GB/s. General
specifications of this device are presented in Table 5.1.
Table 5.1: General specifications of Tesla C1060 [84].
Number of processor cores 240
Processor core clock 1.296 GHz
Global Memory (GDDR3) 4 GB
Internal bandwitdth 102 GB/s
Peak performance (single precision) 933 GFLOPs/s
Peak performance (double precision) 78 GFLOPs/s
Typical power requirement 160 W
Peak power requirement 200 W
42
In addition, each multi processor has a 16 KB shared-memory that provides an
ability for the threads in the same block to communicate and share data. Although it
is limited in size, it has much lower latency compared to global memory. Figure 5.2
shows the general structure of a Tesla C1060 including 30 multi-processors and 240
cores.
Figure 5.2 : General structure of Tesla C1060.
The CUDA toolkit which is a complete software development solution for
programming CUDA-enabled GPUs is utilized as the code design medium. The GPU
is viewed as a compute device capable of executing a very high number of threads in
parallel and it operates as a co-processor to the main CPU which is designated as
host. Data-parallel, compute-intensive portions of applications running on the host
are transferred to the device (GPU) by using a function (kernel) that is executed on
the device as many different threads.
These threads are organized into thread blocks and thread blocks, together, constitute
a grid which may be seen as a computational structure on the GPU. Generally
speaking, a thread block is a batch of threads that can cooperate together by
efficiently sharing data and each thread is identified by its thread ID that indicates
the threads place within a block. Moreover, an application can specify the blocks as
3D arrays and identify each thread using a three-component index. This general
structure is presented in Figure 5.3. In this figure, above is a representation of a 2D
computational grid structure that executes a kernel. Below depicts the structure of a
chosen 3D thread-block (block(1,1)). The layout of a block is specified in a kernel
call to the device by three integers defining the extensions in X, Y and Z that
symbolize the dimensions for the CUDA grid structure. It should be also noted that
one CUDA block can currently contain 512 threads at maximum. However, blocks
that execute the same kernel can be batched together into a grid of blocks, so the total
43
number of threads that can be launched in a single kernel invocation is much larger.
Similar to the thread ID, each block is identified by its block ID that shows the block
place within a grid.
Figure 5.3 : A representation of CUDA structure.
5.2 Code Design
In this study, computational structure on the GPU contains 1D grid which is
composed of 1D blocks as also explained in [85]. The threads in the 1D blocks are
ordered along the X-axis and the blocks in the 1D grid are ordered along the Y-axis.
A 2D computation structure is established with the proper arrangements of the
threads and the blocks; as a result, 3D flow domain is represented by a 2D
computation structure with the help of indexing.
44
Figure 5.4 : Representation of the CUDA grid.
Every finite difference node in the computational mesh is assigned to a thread, so all
the calculations on a node are performed by the same thread. In Figure 5.4, “N”
shows the total node number in one direction; thus, the CUDA computation structure
contains N2 blocks along the Y-axis. In this way, we have N3 threads in total
corresponding to N3 finite difference nodes. Therefore, the CUDA coordinates X and
Y coincide with the physical coordinates x and y, whereas the physical z-coordinate
is represented by the N block slices. In other words, the first N blocks (from the
block “0” to the block “N-1”) represent the first physical xy-plane and the second N
blocks (from the block “N” to the block “2N-1”) represent the second physical xy-
plane and so on. This approach is preferred in order to constitute a general code
structure, since the total thread number in a block is limited to 512 and only 64
threads can take place in the Z-axis of the block.
Figure 5.5 : Computation structure for a 3D flow domain.
45
A general computation grid for a 3D flow domain generated from the 2D CUDA grid
can be visualized as in Figure 5.5. Here “blockDim.x” is the number of threads along
the X-axis in the block. Similarly, “blockIdx.x” and “blockIdx.y” are showing places
of the blocks in the X and Y-axes within the grid, respectively. “threadIdx.x” also
shows the place of the thread in X-axis within the block. These are all built-in CUDA
variables. In addition to these variables, “Thread.idx”, “Thread.idy” and
“Thread.idz” denote the global places of the threads on the X-, Y- and Z-axis
respectively.
Computation of derivatives is the most time consuming portion of the solution
process; therefore, care must be taken to ensure the most efficient implementation.
As mentioned previously, a sixth-order compact finite difference scheme is used for
calculations and this scheme computes derivatives implicitly line by line in each
direction. For a uniform N3 mesh with periodic boundary conditions, solution of an N
x N cyclic tridiagonal linear system with 3N2 different right-hand-side vectors is
necessary to obtain the derivatives of one variable. The coefficient matrix of this
system remains constant throughout the solution process. Although there are some
successful applications, based on the cyclic reduction method, performed previously
for the solution of tridiagonal systems on GPUs [86, 87], it is more effective to invert
and store the relatively small cyclic tridiagonal coefficient matrix at the beginning of
the solution process and multiply it with right-hand-side vectors when needed. An
additional performance gain is obtained by evaluating all rhs vectors at once and
performing a matrix-matrix multiplication instead of multiple matrix-vector
multiplications. This minor change in computation sequence yields a performance
increase from 4 Gflop/s to 250 Gflop/s during the evaluation of the derivatives by
simply providing consistent workload to the processors. The cublasSgemm function
of the CUBLAS is utilized for matrix multiplications.
Furthermore, due to the fact that coalescing appeared as a factor degreading the
performance in some parts of the code, shared memory is exploited to alleviate the
coalescing issues. Because it is on-chip, shared memory is much faster than the
global memory. In CUDA, memory loads and stores by threads of a half warps are
coalesced by the GPU into as few as one transaction when certain access
requirements are met [88].
46
Figure 5.6 : Ideal coalesced access pattern.
Figure 5.6 shows an ideal coalesced access pattern. This is the simplest case of
coalescing in which every thread accesses the corresponding memory space in an
aligned fashion. With this ideal access pattern, memory demand of an half warp of
threads can be met by only one transaction. A more comprehensive explanation of
this issue may be found in [89].
In this study, non-coalescing problem appears in the kernels that compute the right
hand side vectors of the compact and filtering scheme for the x-direction. As an
example, Figure 5.7 indicates a non-coalesced access pattern of the kernel that
computes the right hand side vectors of the compact scheme for x-derivatives. In the
figure, first half warp of any thread block is shown for the 1283 mesh case and each
different color indicates one 128 byte memory segment.
Figure 5.7 : Non-coalesced access pattern of the code.
As it may be seen from the compact scheme (3.43), each thread must read four
values from the global memory to set up right hand side vectors for the first
derivatives.
¯ℎ|ZF\ = 19 PH@N& − H@L&4ℎ Q + 149 PH@NM − H@LM2ℎ Q (5.1)
47
Accordingly, in the case of periodic boundary conditions, expression of the right
hand side value computed by the thread “0” for node “0” on the left boundary is
shown below:
¯ℎ|Z0\ = 19 PH& − H°L&4ℎ Q + 149 PHM − H°LM2ℎ Q (5.2)
Therefore, thread “0” must access considerably different memory segments to get
values H°L& and H°LMtogether with HMandH&. For the purpose of clarity, Figure 5.7
shows just those memory spaces accessed by the threads for the values of H@L&. Only
left boundary is shown in the figure but a similar situation arises for threads F = 1,F = � − 2 and F = � − 1. This problem is resolved by copying the memory
contents required by the thread blocks into the shared memory before forming right-
hand-side vectors. Here, only scheme for the first derivatives is taken into account,
actually the same problem occurs with the scheme for the second derivatives. Thus,
shared memory is also used for the computation of the second derivatives.
As a result of using the shared memory for right-hand-side assembly and some
additional minor improvements such as temporary reordering of data to be used by
CUBLAS calls, performance improvements up to 26% are realized as shown in
Table 5.2.
Table 5.2: Performance gain obtained from the aimprovements in memory access.
Total node Before
improvement (Gflop/s)
After improvement
(Gflop/s)
323 37.85 41.63
483 56.20 62.76
643 68.24 78.93
1283 98.86 124.62
49
6. APPLICATIONS AND PERFORMANCE COMPARISONS
Calculations for three basic flow problems as test cases are performed. First one is
the advection of a vortical disturbance, which is an inviscid flow problem and is
taken as an example showing the capability of the compact scheme to advect vortical
structures accurately. The second one, an example of unsteady and viscous flow, is
the temporal mixing layer. And the last one, flow simulation around a square
cylinder, is an application of IBM. Single precision is used for first and second test
cases and as mentioned before, a sixth-order compact scheme and a tenth-order
filtering scheme are applied. Though, in the third test case, a mixed-precision is used.
That is, single precision is generally applied in the code but double precision is also
exploited for the solution of Poisson equation during the iterations of conjuge
gradient method. And sixth-order compact schemes for the first and second
derivatives are utilized for the last test case. For the first and second test cases,
performance of the GPU solution in terms of calculation time per time step is
compared with a single core CPU (AMD Phenom 2.5 GHz) solution. The CPU code
using the same algorithm was written in FORTRAN and IFORT compiler optimized
version of the code is used for the comparisons. Major difference between the two
implementations is in the way of handling cyclic tridiagonal matrices. The CPU code
uses LAPACK/BLAS library to solve cyclic tridiagonal matrices efficiently on a
single processor, whereas the GPU code inverts and stores the coefficient matrix in
the beginning of the computation and repeatedly multiplies it with multiple right
hand side sets. For a tri- or penta-diagonal system on a CPU, this is an inefficient
method with extra computing load, but on a GPU it is efficiently processed compared
to implementations of tri- or pentadiagonal algorithms, especially for matrices with
periodic boundary conditions. In order to make a fair comparison of the computing
performance, appropriate algorithm for linear system solution is implemented for
each computing platform.
50
6.1 Advection of a Vortical Disturbance
This is a two-dimensional, unsteady and inviscid flow problem [14]. In this test case,
a vortex is initially defined and expected to advect with the free stream velocity
along the flow field. It is used as a test case in order to see the ability of the high-
order compact scheme in convecting the vortex accurately. As the initial condition, a
vortex centered about Z�±, �±\ is defined in the flow domain with the following
Cartesian velocity components:
= �R − ²Z� − �±\�& ��-−¯&2 ,� = ²Z� − �±\�& ��-−¯&2
¯& = Z� − �±\& + Z� − �±\&�& �R = 34] |⁄
(6.1)
Here, R equals to 1 as the core radius of the vortex and the vortex strength parameter
² Z�R�\⁄ is taken as 0.02. 2D advection equations are solved on a uniform
Cartesian mesh employing three different levels of resolution (∆0³ = ∆1³ =0.2, 0.1, 0.05). A representative picture of the flow domain may be seen in Figure
6.1.
Figure 6.1 : Flow domain.
51
The comparison between the exact and the calculated values in terms of vorticity
magnitude along the horizontal centerline is denoted in Figure 6.2. The comparison
is performed at the instant when the vortex is convected a distance of 6R.
Figure 6.2 : Comparison of the computed vorticity magnitude with exact solution.
Furthermore, Table 6.1 shows the maximum error in the computed vorticity along
the horizontal centerline. It may be seen from the values that the order of accuracy
for the compact scheme, 5.7, is close to the expected formal accuracy value.
Table 6.1: Absolute errors for the GPU calculations.
∆� Error
0.2 3.5x10-2
0.1 6.6x10-4
0.05 1.2x10-5
Obtained GPU solutions are also compared with the CPU results in terms of
calculation time and these comparisons are presented in Figure 6.3 and Figure 6.4.
-0.30
-0.10
0.10
0.30
0.50
0.70
0.90
1.10
1.30
1.50
4 6 8 10 12 14 16
ω
x/R
Exact
Compact
52
Figure 6.3 : Performance comparison, calculation time for one time step.
Figure 6.4 : Performance comparison, speed up values.
Moreover, Figure 6.5 indicates the time taken per one node for one time step.
0.01
0.047
0.2
0.001 0.005
0.02
0.00
0.05
0.10
0.15
0.20
0.25
0 50 100 150 200 250 300 350
Tim
e/tim
e st
ep (
s)
Grid size (N)
CPU vs. GPU
CPU
GPU
9
9.4
10
8.5
9
9.5
10
10.5
80 160 320
Spe
ed u
p
Grid size (N)
Speed up over CPU
53
Figure 6.5 : Performance comparison, time taken per node.
6.2 Temporal Mixing Layer
This is a three dimensional, unsteady, viscous flow problem with periodic boundary
conditions as in the previous example. Calculations are realized in flow domain with
the dimensions 2DxDxD in x,y, and z Cartesian coordinates respectively. Generally,
this flow case involves an initial condition of a hyperbolic tangent velocity profile at
t = 0. In addition to this velocity profile, a white-noise superimposed on the initial
velocity profile gives rise to small perturbations in order to form the Kelvin-Helmotz
eddies [90, 91]. Initial velocity profile is given below in (6.2). Here � and C@ are free
stream velocity and initial shear layer thickness respectively.
ZO\ = ��iGℎ 2OC@ � = 28] |⁄ C@ = 128 (6.2)
Expression for the vorticity thickness at any time is given by Lesieur et al. [90] as:
C = 2�Z� ́ �O\⁄ Vµ0 (6.3)
A superimposed white-noise perturbation which triggers the Kevin-Helmotz
instabilities may be defined as:
1.56
1.84
1.95
0.16 0.200.20
0.00
0.40
0.80
1.20
1.60
2.00
0 2 4 6 8 10 12
Tim
e pe
r no
de (
x10-6
)
Total node number (x104)
Time per node for CPU and GPU
CPU
GPU
54
¶ = 0.1��L62·98�@¸0 (6.4)
Here d is the most amplified mode and defined as d = I.nn· [92]. Figure 6.6 shows
the flow domain related to this case.
Figure 6.6 : Mixing layer flow domain.
For this problem, 3D Navier-Stokes equations are solved. Figure 6.7 shows the
temporal development of four eddies obtained from the GPU calculations. These
plots are shown on an xz-plane placed midway along the y-direction. Starting from
t = 0, cores of eddies appear, then four eddies become apparent at t = 15δ U⁄ .
Pairing of eddies may be observed starting from t = 40δ U⁄ .
Again, performance of the GPU implementation in terms of calculation time per time
step is compared with the CPU code as shown in Figure 6.8 and corresponding speed
up values are given in Figure 6.9.
U
U
55
t = 0
t = 7.5δ U⁄
t = 15δ U⁄
t = 22.5δ U⁄ t = 30δ U⁄
t = 40δ U⁄
Figure 6.7 : Evolution of eddies.
Table 6.2: Required Gflop per time step.
Grid size (N) CPU(Gflop/time-step)
GPU(Gflop/time-step)
32 0.2 0.41
48 0.65 1.88
64 1.56 5.59
128 12.5 81.0
It may be seen in Figure 6.9 that, as the computational mesh size increases, speedup
of GPU over CPU approaches a limit. This is attributed to the essentially inefficient
solution algorithm of the cyclic tridiagonal linear system on the GPU. In this respect,
Table 6.2 gives the required total number of floating point operations per time step
for different meshes on the CPU and the GPU. The CPU code, optimized to run on a
single CPU requires 12.5 Gflops per time-step of computation for the finest mesh
whereas the GPU implementation requires 81 Gflops per time-step for the same
problem size. Nevertheless the GPU implementation is 16.5 times faster than the
CPU implementation. Because the application comprises of many linear system
solutions, CPU implementation using LAPACK/BLAS library needs fewer
operations to obtain the results and the total operation count needed by the CPU for
one time step increases proportional to the N3, whereas it increases proportional to
N4 for the GPU implementation. Therefore, as the mesh gets bigger, this difference
becomes more dominant and speedup is limited.
56
Figure 6.8 : Performance comparison, calculation time for one time step.
And Figure 6.10, again, shows the time taken per node for one time step.
Figure 6.9 : Performance comparison, speed up values.
0.1450.48
1.14
10.75
0.01 0.030.07
0.65
0
2
4
6
8
10
12
0 20 40 60 80 100 120 140
Tim
e/tim
e st
ep (
s)
Grid size (N)
CPU vs. GPU
CPU
GPU
14.5
16.0
16.3
16.5
13.0
13.5
14.0
14.5
15.0
15.5
16.0
16.5
17.0
32 48 64 128
Spe
ed u
p
Grid size (N)
Speed up over CPU
57
Figure 6.10 : Performance comparison, time taken per node.
Moreover, Figure 6.11 shows the timing breakdown of the GPU code for one time
step in the case of the 1283 mesh. It is clear that solutions of linear systems constitute
the most time consuming part in the code. In addition, since the filtering scheme also
generates cyclic tridiagonal systems, a remarkable portion of the filtering part
requires solutions of similar cyclic tri-diagonal systems. This emphasizes the need
for an efficient cyclic tridiagonal solver which may lead to much higher speedups of
GPU codes over a single CPU.
Figure 6.11 : Timing breakdown of one time step for 1283 mesh on GPU.
44.25
43.40
43.49 51.26
3.05 2.71 2.67 3.10
0.00
10.00
20.00
30.00
40.00
50.00
0.00 0.50 1.00 1.50 2.00
Tim
e p
er
no
de
(x
10
-7)
Total node number (x106)
Time per node for CPU and GPU
CPU
GPU
37%
20%
18%
12%
12%
Linear systems
Right sides
Solution vec.
Filtering
Other kernels
58
6.3 IBM Test Case - Flow Around a Square Cylinder
Fluid flow around bluff bodies is one of the most encountered flow problems in
various engineering fields. It has many practical applications in terms of CFD. One
of the main characteristics of this type of flows is the flow separation taking place in
the considerable portion of the surface. Thus, in turn, it leads to a remarkable wake
region which is characterized mainly by the shedding vortices from upper and lower
sides of the body. Depending on the flow system, this wake region may presents
some benefits like enhanced heat and mass transfer, but at the same time it may
cause some tiresome issues such as unwanted vibrations. Therefore, predicting the
wake region accurately is quite important. Most of the researches focused on the
flows around circular cylinder in the past [93]. Another important bluff body is
square cylinder. The flow around a square cylinder is similar to the circular cylinder
case in some ways. Nonetheless, it has also important differences in comparison with
circular cylinder. The square cylinder has fixed seperation points on either leading or
trailing edges in contrast to the circular cylinder. Moreover, generated wake
immediately after the square cylinder is wider than that of the circular cylinder. As a
result of this, the region covered by formation of the Karman vortices is much longer
and broader in the square cylinder case. Various flow regimes are encountered in the
flow past a square cylinder depending on the Reynolds number that is defined in
terms of free stream velocity, cylinder dimension and kinematic viscosity. If the
Reynolds number is below or about the unity, then no seperation is observed at the
surface of the cylinder. While the Reynolds number increases, the flow starts to
seperate first at the trailing edges and constitutes a closed, steady recirculation region
immediately behind the cylinder. If the Reynolds number is increased further, it
reaches a critical value (Rec) in which a periodic vortex shedding known as von
Karman vortex street appears in the wake region. Researchers give different values
for the critical reynolds number. Okajima [94] has performed an experimental study
for the rectangular cylinders and specifies an approximate value (Rec ≈ 70) for the
critical Reynolds number. Another experimetal study may be found in . Other values
given for the Rec are 53 in [95], 51 in [96]. Transition from the steady state to the
unsteady state is realized at this Reynolds number. Then if the Reynolds number is
further increased, the flow around a square cylinder makes a transition to three
dimensionality beyond a certain critical Reynolds number. There is not an exact
59
value for this critical Reynolds number, rather generally an interval is given for this
Reynolds number. In [97] it is said that transition to three dimensionality in the wake
of a square cylinder takes place in the limit between 150 and 175 of Reynolds
number. Similarly, [98] also gives the critical Reynolds number for the transition to
three dimesionality as �� = 162 ∓ 12.
In this study, simulation of the flow around a square cylinder is applied in order to
evaluate the IBM on GPU for the Reynolds numbers between 100 and 300. As
mentioned in the section involving the used numerical methods throughout this
thesis, sixth-order compact central finite difference schemes are utilized for both first
and second spatial derivatives for the discretization of the incompressible Navier-
Stokes equations.
Flow domain of this test case and applied boundary conditions are given in Figure
6.12. The square cylinder is placed between � = 6, � = 7iG�� = 5.85, � = 6.85.
Side length of the cylinder is » = 1, and its height is taken as 3.875». Values of the
dimensions in streamwise, stream-normal and spanwise directions are given in Table
6.3
Figure 6.12 : Immersed boundary flow domain and boundary conditions.
Table 6.3 : Dimensions of the flow domain.
X1/D Y1/D Z1/D
25.5 12.7 3.875
60
At the inlet, applied boundary conditions are = �R, � = # = 0 and �- ��⁄ = 0.
At the outlet, boundary conditions are � ��⁄ = �� ��⁄ = �# ��⁄ = 0 and - = 0.
Moreover, periodic boundary conditions are imposed to the crosswise and spanwise
boundaries.
Uniform grids are used for the computations of this test case. Total node numbers
along each Cartesian coordinate (x, y, z) are 384x192x32 nodes respectively. In
vortex dominated flow fields, It is quite important to calculate not only the region
near the body but also the wake region accurately. This is also another factor for the
usage of high-order compact scheme that has high accuracy and low numerical
dissipation in the region far from the immersed boundary. In this respect, using high-
order compact scheme also gives an opportunity to apply the uniform grid
throughout the flow domain. As a result, generated grid becomes coarser in the
vicinity of immersed boundary but fine enough to reach accurate results in terms of
whole flow domain. In Table 6.4 below, ∆� values which shows the first node
distance from the solid body for various studies in literature and present study are
indicated. The table also gives numerical methods used for the mentioned studies.
Table 6.4 : First node distance from the body.
Studies ∆� Method
Sohankar et al. [96] 0.004 Finite volume method.
2nd-3rd order in space.
Sharma and Eswaran [99] 0.01 Finite volume method.
2nd order in space.
Sahu et al. [100] 0.01 Finite volume method.
2nd order in space.
Singh et al. [101] 0.007 Finite volume method.
2nd order in space.
Sen et al. [102] 0.001 Finite element method
Present 0.066
61
As it is seen on the table, distance of the first node to the body in this study is at least
five times higher in comparison to others.
In the context of this study, Figure 6.13 shows the instantaneous streamlines and the
velocity contours of the flow domain for �� = 100. Form of the antisymmetric
wake flow, called as von Karman vortex street, can be seen evidently. As may be
seen in Figure 6.13, with the purpose of describing the characteristic of immersed
boundary method, these illustrations do not involve a square cylinder in the flow
domain as an object, but flow behaves just like there is one.
a
b
Figure 6.13 : a) Instantaneous streamlines, b) � velocity contours of the flow domain for Re =100.
Moreover, Figure 6.14, through a-h, depicts instantaneous streamlines in the vicinity
of square cylinder for a full vortex shedding cycle at �� = 100. In Figure 6.14, there
are also coreesponding images below the each phase of the cycle, showing the results
of another computational study [99]. With its eight consecutive frames Figure 6.14
covers a vortex shedding period. Centers and saddles related to vortex shedding are
also indicated in the figure. In addition, the process of instantaneous ‘alleyways’
mentioned in [103] is also demonsrated in the figure. In this process, fluid is sucked
62
into the recirculation region from both top and bottom of the cylinder succesively
through the “alleyways”.
a b
c d
Figure 6.14 : Instantaneous streamlines in the vicinity of square cylinder for a full avortex shedding cycle, Re = 100 (above: Present study, below: Sharma aand Eswaran [99]).
63
e f
g h
Figure 6.14 (continued) : Instantaneous streamlines in the vicinity of square acylinder for a full vortex shedding cycle, Re = 100 a(above: Present study, below: Sharma and Eswaran [99]).
As an example two of these alleyways are shown by the red arrows in Figure 6.14
b,e. For instance, looking at the figure, it can be seen that, the period begins with a
vortex maturing and shedding from lower side of the cylinder. At the same time,
during this interval, fluid is constantly sucked into the recirculation region from the
upper side . After the vortex sheds, then the same process is starting again but with
the opposite sides. This time vortex matures and sheds from the upper side of the
cylinder, and the fluid is sucked into the recirculation region from the lower side.
64
In addition, Figure 6.15 a-h gives the u velocity contours(above) and corresponding
pressure contours (below) for the same cycle presented in Figure 6.14.
Figure 6.15 : U velocity and pressure contours for a full vortex shedding cycle, Re = 100.
a b
c d
f e
65
Figure 6.15 (continued) : U velocity and pressure contours for a full vortex ishedding cycle, Re = 100.
Moreover, Figure 6.16 shows a comparison of the recirculation length defined as the
streamwise distance from the base of the square cylinder to the re-attachment point
on the wake centerline.
Figure 6.16 : Comparison of Lr. (Yoon et al. [104], Sharma and Eswaran [99], aRobichaux et al. [98])
1
1.5
2
90 100 110 120 130 140 150 160 170 180 190 200 210
Lr/
D
Re
Yoon et al. (2010)
Sharma and Eswaran (2004)
Robichaux et al. (1999)
Present
g h
66
The figure shows that the obtained recirculation lengths are in good agreement with
the results of other computational studies [98, 99, 104].
Strouhal number is also an important non-dimensional parameter giving the
information about the vortex shedding frequency of the flow and defined as:
¢� = H»�R (6.5)
Here, H, »and �Rstand for shedding frequency, cylinder dimension and free stream
velocity respectively. Shedding frequency may be obtained by appliying fast fourier
transform (FFT) to the lift coefficient or drag coefficient signals. Figure 6.17, as a
result of a FFT implementation, shows the power spectrum of Cl and Cd respectively.
As the figure indicates, while lift coefficient oscillates at the shedding frequency,
drag coefficient oscillates with twice the shedding frequency.
Figure 6.17 : Power spectrum of Cl (left) and Cd (right) for Re = 100.
In addition, Table 6.5 shows a comparison made between various previous studies
and the present one in terms of Strouhal number for �� = 100. As it may be seen
from the table, obtained value for St is quite consistent with other studies. The
biggest difference between the present value and the others is less than 1.5%. As an
addition to these Table 6.6 and Figure 6.18 indicate the comparisons of different
studies in terms of variation of Strouhal number with Reynolds number. It is
observed in the figure that obtained Strouhal number values in this study play along
with the results of other studies. If we look at the Figure 6.18 in a broad sense, it can
be seen that while the St values for �� = 100 and �� = 150 are clustered in a
narrower band, the values for the Reynolds number higher than 150 exhibit a more
discrete behaviour.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
P
f
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
P
f
67
Table 6.5: Comparison of Strouhal numbers ifor �� = 100.
Studies St
Sohankar et al. [96] 0.146
Sharma and Eswaran [99] 0.148
Sahu et al. [100] 0.148
Singh et al. [101] 0.147
Sen et al. [102] 0.145
Present 0.146
Table 6.6: Comparison for the variation of Strouhal number with Re.
Re = 100 Re = 150 Re = 200 Re = 250 Re = 300
Okajima (exp.) [94] 0.142 0.143 0.145 0.144 0.145
Sen et al. [102] 0.145 0.160
Singh et al. [101] 0.147 0.156 0.143 0.133
Sharma and Esweran [99] 0.148 0.160
Saha et al. [97] 0.15 0.165 0.161 0.146 0.130
Sohankar et al. [105] 0.165 0.160 0.159 0.153
Robichaux et al. [98] 0.164 0.155
Sohankar et al. [96] 0.146 0.165
Franke et al. [106] 0.157 0.146 0.130
Davis et al. [107] 0.164 0.176
Present 0.146 0.161 0.158 0.156 0.155
68
In this context, it may be useful to remember that the flow around a square cylinder
makes a transition to three dimensionality beyond a certain critical Reynolds number.
There is not an exact value for this critical Reynolds number, rather generally an
interval is given for this Reynolds number. In [97] it is said that transition to three
dimensionality in the wake of a square cylinder takes place in the limit between 150
and 175 of Reynolds number. Similarly, [98] also gives the critical Reynolds number
for the transition to three dimesionality as �� = 162 ∓ 12. Therefore, three
dimensional effects may give some hints about the discreteness of St values for the
Re values higher than 150 in Figure 6.18.
Figure 6.18 : Variation of Strouhal number with Reynolds number. Mentioned studies: Okajima [94], Sen et al. [102], Singh et al. [101], Sharma and Eswaran [99], Saha et al. [97], Sohankar et al. [105], Robichaux et al. [98], Sohankar et al. [96], Franke et al. [106], Davis et al. [107].
Furthermore, because the no-slip boundary condition is not enforced strictly on the
body during the solution process, this condition is satisfied approximately.
Consequently, penetration of some streamlines into the immersed boundary may
occur as shown in Figure 6.14. As a result of this, although the wake region and
vortex shedding are simulated accurately, this issue somewhat deteriorates the force
calculations. Therefore, RMS of lift coefficients and time averaged drag coefficients
may be seen in Table 6.7 together with the other studies for Re = 100. It may also be
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
50 100 150 200 250 300 350
St
Re
Re vs. St
Okajima (experiment)
Sen et al.
Singh et al.
Sharma and Esweran
Saha et al.
Sohankar et al. (1999)
Robichaux et al.
Sohankar et al. (1998)
Franke et al.
Davis et al.
Present 3D
69
seen from the table Clrms value deviates much more than Cdave value. While Clrms
differs from the nearest value by more than 25%, this ratio for the Cdave is about 5%.
The same trend is observed for the various Reynolds numbers, those results are
indicated in Figure 6.19 and Figure 6.20. Likewise, drag coefficient exhibits the
similar pattern with other studies, whereas lift coefficient indicates more discrepancy
in comparison with drag coefficient.
Table 6.7: Comparison of Clrms and Cdave.
Study Clrms Cdave
Sharma and Eswaran [99] 0.19 1.49
Sahu et al. [100] 0.19 1.50
Singh et al. [101] 0.16 1.52
Sen et al. [102] 0.19 1.53
Saha et al. [97] 0.12 1.50
Present 0.26 1.41
Figure 6.19 : RMS of lift coefficients for the various Reynolds numbers.
0
0.2
0.4
0.6
0.8
1
50 100 150 200 250 300
CL r
ms
Re
Present
Sharma and Eswaran
Sahu et al.
Singh et al.
Sen et al.
Saha et al.
Sohankar et al.
70
Figure 6.20 : Time averaged drag coefficients for the various Reynolds numbers.
Generally speaking, solution process of the Poisson equation takes the vast majority
of the running time for one time step. Therefore, performance of the solution process
of this equation affects directly the overall performance of the incompressible flow
solver. Figure 6.21 shows the timing breakdown of one time step in the solution of
incompressible Navier-Stokes equations for the mesh consisting of 2x106 nodes in
total.
Figure 6.21 : Timing breakdown of one time step.
1.3
1.4
1.5
1.6
1.7
1.8
50 100 150 200 250 300
Cd a
ve
Re
Present
Sharma and Eswaran
Sahu et al.
Singh et al.
Sen et al.
Saha et al.
Sohankar et al.
Robichaux et al.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Poisson
Solution
Other parts
71
It may be obviously realized that, Poisson equation solution is the most decisive part
of all in terms of general performance. If we go further in analyzing the timing
breakdown of the relevant parts of the code, we can come up with the Figure 6.22.
Figure 6.22 : Timing breakdown of CG iterative solver.
This figure, like the previous one, also indicates a timing breakdown of a certain part
in the code in which conjugate gradient iterative solver takes place. The figure shows
time consuming percentages of each part for one iteration. It may also noted from the
figure that sparse matrix-vector multiplication is the most time consuming section of
the iterative process (≈75%). Currently, there is also a CUDA library in order to
handle sparse matrix and vector operations on CUDA enabled devices. Name of the
library is CUSPARSE [108]. Therefore one alternative is to utilize this library to
make the sparse matrix-vector multiplications in the solution process of Poisson
equation and the result indicated in Figure 6.22 is obtained by utilizing the
CUSPARSE library. In this respect, Figure 6.23 presents sparse matrix-vector
multiplication performance of CUSPARSE library.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SpMV
daxpy
ddot
dscal
72
Figure 6.23 : SpMV performance of CUSPARSE library.
These values in the figure are obtained for a sparse matrix of standard 7 point
Poisson stencil with different total node numbers. CUSPARSE library can use
different sparse matrix format, such as coordinate (COO), compressed sparse row
(CSR) and compressed sparse column format (CSC). Presented values in Figure 6.23
are obtained by using CSR format. Instead of CSR format, a different storage format
also may be applied. In this study, ELLPACK format explained in section 3.5 is
exploited for the sparse matrix-vector multiplication. Written custom kernel
involving ELLPACK gives considerably better performance in comparison with
previously mentioned CUSPARSE library. Figure 6.24 shows this comparison for
the sparse matrix-vector multiplication for the cases with different total nodes. As it
may be seen in the figure ELLPACK format based kernel achieves approximately
3.5x speed ups over CUSPARSE library for the different grids.
In addition, constituted conjugate gradient iterative solver which uses ELLPACK
format is compared with the performance values in another GPU application from
literature [60]. In [60], conjugate gradient solver of CUSP library is used for the
solution of sparse linear systems, and a performance comparison between a GPU
(CUSP) and a CPU implementation is presented for the solution of sparse linear
systems.
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0 1 2 3 4
Tim
e (
s)
Total node number (x106)
SpMV with CUSPARSE
73
Figure 6.24 : Performance comparison of CUSPARSE and ELLPACK.
CUSP is an open-source library for sparse linear algebra and graph computations on
CUDA [109]. Used CPU and GPU in that study are Intel Xeon X5650 and Tesla
C2050 respectively. Figure 6.25 shows this comparison for the solution of sparse
linear systems. Systems in this comparison are formed by the use of general 5-point
Poisson stencil. As it also may be seen in the figure present application of conjugate
gradient solver gives approximately 3 times better performance in comparison with
the application of CUSP library on the GPU. As a consequence of this performance
improvement, weight of the sparse matrix-vector multiplication in the conjugate
gradient algorithm decreases considerably. The final status of the timing breakdowns
may be seen in Figure 6.26. Figure 6.26 a and b shows the timing breakdowns of
incompressible flow solver and conjugate gradient iterative solver respectively.
Although, there is a reduction in percentage of the Poisson solver, it still has a share
of 86% and dominates the solution time heavily. Therefore, any further improvement
in the solution of Poisson equation can affect the performance of the GPU directly.
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0 1 2 3 4
Tim
e (
s)
Total node number (x106)
CUSPARSE
ELLPACK
74
Figure 6.25 : Time comparison for the solution of sparse linear systems.
a
b
Figure 6.26 : Timing breakdown of incompressible solver and CG iterative solver after the use of ELLPACK form.
0
50
100
150
200
250
300
350
400
0.5 1 1.5 2 2.5 3 3.5 4
Tim
e (s
)
Total node number (x106)
CPU
GPU(cusp)
GPU (ELLPACK)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Poisson
Solution
Other parts
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SpMV
daxpy
ddot
dscal
75
7. CONCLUSIONS
Purpose of this study is to test and apply some numerical tools, used in CFD, to GPU
that is coming to the forefront as a powerful computing device in recent years.
Applied numerical tools, a sixth-order compact scheme, a tenth-order filtering and
immersed boundary method, also attract a great deal of attention for the flow
simulations involving complex features and turbulent regime. A successful
incorporation of effective numerical methods and competent computing machines,
like GPU, may give rise to impressive progress in CFD.
With this object in mind, several applications are realized on the GPU in order to
analyze the real potential of this combination. Uniform grids are used for all test
cases. In vortex dominated flow fields, it is quite important to calculate not only the
region near the body but also the wake region accurately. This is also another factor
for the usage of high-order compact scheme that has high accuracy and low
numerical dissipation in the region far from the boudaries. In this respect, using high-
order compact scheme also gives an opportunity to apply the uniform grid
throughout the flow domain. As a result, generated grid becomes coarser in
comparison with the standard grid stretching approach in the vicinity of immersed
boundary but fine enough to reach accurate results in terms of whole flow domain.
Therefore, in first place, implementation of a high-order compact finite difference
scheme on a GPU is performed for the solution of some basic fluid dynamics
problems. A sixth-order compact central scheme and a tenth-order filtering scheme
are utilized in the solution of two example problems: advection of a vortical
disturbance and the temporal mixing layer. As a result of periodic boundary
conditions applied for both problems, the compact scheme and the filtering scheme
generate cyclic tridiagonal linear systems. In the CPU implementation which is
compared with the GPU code, these linear systems are solved efficiently by using
LAPACK/BLAS library, whereas the GPU code utilizes the inverse of the coefficient
matrix to solve the same linear system, resulting a faster but relatively inefficient
computation. Therefore, as the mesh size gets bigger, allocated time for solutions of
76
the linear systems increases remarkably for the GPU implementation. In this context,
for the finest mesh, 42% of the operating time is consumed to solve cyclic tridiagonal
linear systems. This shows that an improved solver for the solution of these relatively
small sized cyclic tridiagonal systems would further increase speedup of the GPU
implementation over the CPU implementation. In addition, shared memory which
has much lower latency values than the global memory is utilized in order to avoid
non-coalescing memory accessing. Currently, speedups between 9x–16.5x are
achieved on GPU compared to CPU computations.
It is concluded that, compact-finite-difference schemes on uniform Cartesian meshes
for the solution of CFD problems can achieve excellent speedup when implemented
on a GPU. As they require repeated solution of a relatively small cyclic tridiagonal
system with constant coefficients and multiple right-hand sides, further speedup is
possible with an appropriate solver for the specific structure of the linear system.
Furthermore, another test case involving immersed boundary method is also utilized.
Incompressible Navier-Stokes equations are solved on a Cartesian mesh with the
purpose of simulating the flow around a square cylinder. A direct forcing approach is
utilized for the immersed boundary method applied in this problem. In this approach,
the forcing term is inserted to the momentum equation in order to ensure the proper
boundary velocity value on the surface of immersed body. Again, a sixth-order
compact central scheme is used for the discretization of spatial derivatives and the
solution procedures related to compact scheme is the same as described shortly
before. Time advancement is ensured with the use of second-order Adams-Bashforth
scheme. Besides, a conjugate gradient iterative method is incorporated on GPU for
the solution of Poisson equation. In this context, two different approach are tested
for sparse matrix-vector multiplication which is the most decisive process in the
conjugate gradient iterative solver. First approach tested includes CUSPARSE sparse
matrix and vector operations library that utilizes the CSR format for the sparse
matrix. The second approach is a written custom kernel which implements
ELLPACK format for the saparse matrix. As a result, it is found that ELLPACK
format is more efficient for the problem dealt with. In addition to this, it provides an
opportunity for accessing the memory in a coalesced way. Within this scope, one
thing should also be added that shared memory is exploited for conjugate gradient
method in sparse-matrix vector multiplication too. Immersed boundary effects are
77
provided by inserting a force term directly to the momentum equation. Thus, with
this suitably designed direct forcing term, a square cylinder is represented on a
Cartesian mesh. Results are obtained for the Reynolds number range of 100-300.
Obtained streamline structure for one cycle of vortex shedding at �� = 100 is
presented to show the capability of the method to simulate the wake region of the
cylinder. Moreover, comparisons with other studies are exhibited in terms of
Strouhal numbers for various Reynolds numbers. It is seen that present
implementation simulates the wake region and vortex shedding accurately, but is not
very successful at estimating the force coefficients. Due to approximately satisfied
no-slip boundary condition, stremlines may penetrate into the immersed body. In
turn, this causes to degraded results especially for the lift coefficient. Furthermore, a
comparison is made regarding the performance of the present GPU application. Since
solving the sparse linear systems efficiently is the most decisive factor in terms of the
performance, a comparison is performed between present implementation and other
GPU and CPU applications for the sparse linear system solutions. As a consequence,
about 3x and 20x speedups are achieved in comparison with GPU and CPU
respectively.
Generally, GPUs obviously have pretty much potential as parallel computing
devices. With accordingly coded algorithms and the advancements in numerical tools
that can exploit the GPUs strength effectively, they may become the most influential
assistants of researchers in dealing with large, complex fluid flow problems.
7.1 Future Works
Some further improvements in terms of obtained results and performance can be
achieved by incorporating some future works in to the present one. First of all,
GPGPU market develops steadily, brand-new GPUs come onto the market just every
several years. Even only with one of the latest GPU, one can reach higher
performance values by doing slight modifications to constituted code here. Apart
from this, future work may include the followings:
- Necessary modifications for porting the code to multi-GPU environment.
- Constitution of more efficient lineer solvers for cyclic tridiagonal systems.
78
- Incorporation of a spectral solver instead of conjugate gradient approach for
Poisson equation.
- Grid clustering in the vicinity of an immersed boundary for more accuracy.
79
REFERENCES
[1] Lele, S. K. (1992). Compact finite-difference schemes with spectral-like resolution, Journal of Computational Physics, 103(1), 16-42.
[2] Visbal, M. R. & Gaitonde, D. V. (2002). On the use of higher-order finite-difference schemes on curvilinear and deforming meshes, Journal of Computational Physics, 181(1), 155-185.
[3] Hirsh, R. S. (1975). Higher-order accurate difference solutions of fluid mechanics problems by a compact differencing technique, Journal of Computational Physics, 19(1), 90-109.
[4] Weinan, E. & Liu, J. G. (1996). Essentially compact schemes for unsteady viscous incompressible flows. Journal of Computational Physics, 126(1), 122-138.
[5] Gamet, L., Ducros, F., Nicoud, F. & Poinsot, T. (1999). Compact finite difference schemes on non-uniform meshes. Application to direct numerical simulations of compressible flows, International Journal for Numerical Methods in Fluids, 29(2), 159-191.
[6] Mahesh, K. (1998). A family of high order finite difference schemes with good spectral resolution, Journal of Computational Physics, 145(1), 332-358.
[7] Meitz, H. L. & Fasel, H. F. (2000). A compact-difference scheme for the Navier-Stokes equations in vorticity-velocity formulation, Journal of Computational Physics, 157(1), 371-403.
[8] Demuren, A. O., Wilson, R. V. & Carpenter, M. (2001). Higher-order compact schemes for numerical simulation of incompressible flows, part I: Theoretical development, Numerical Heat Transfer Part B-Fundamentals, 39(3), 207-230.
[9] Meinke, M., Schroder, W., Krause, E. & Rister, T. (2002). A comparison of second- and sixth-order methods for large-eddy simulations, Computers & Fluids, 31(4-7), 695-718.
[10] Nagarajan, S., Lele, S. K. & Ferziger, J. H. (2003). A robust high-order compact method for large eddy simulation, Journal of Computational Physics, 191(2), 392-419.
[11] Shukla, R. K., Tatineni, M. & Zhong, X. L. (2007). Very high-order compact finite difference schemes on non-uniform grids for incompressible Navier-Stokes equations, Journal of Computational Physics, 224(2), 1064-1094.
[12] Rizzetta, D. P., Visbal, M. R. & Morgan, P. E. (2008). A high-order compact finite-difference scheme for large-eddy simulation of active flow control, Progress in Aerospace Sciences, 44(6), 397-426.
80
[13] Boersma, B. J. (2011). A 6th order staggered compact finite difference method for the incompressible Navier-Stokes and scalar transport equations, Journal of Computational Physics, 230(12), 4940-4954.
[14] Visbal, M. R. & Gaitonde, D. V. (1999). High-order-accurate methods for complex unsteady subsonic flows, Aiaa Journal, 37(10), 1231-1239.
[15] Mittal, R. & Iaccarino, G. (2005). Immersed boundary methods, Annual Review of Fluid Mechanics, 37, 239-261.
[16] Peskin, C.S. (2002). The immersed boundary method, Acta Numerica, 11, 479-517.
[17] Peskin, C. S. (1972). Flow Patterns around heart valves - numerical method, Journal of Computational Physics, 10(2), 252-271.
[18] Liao, C. C., Chang, Y. W., Lin, C. A. & McDonough, J. M. (2010). Simulating flows with moving rigid boundary using immersed-boundary method, Computers & Fluids, 39(1), 152-167.
[19] Lei, Y. X., Wei, H. G. & Xing, Z. (2010). Large-eddy simulation of flows past a flapping airfoil using immersed boundary method, Science China-Physics Mechanics & Astronomy, 53(6), 1101-1108.
[20] Paravento, F., Pourquie, M. J. & Boersma, B. J. (2008). An immersed boundary method for complex flow and heat transfer, Flow Turbulence and Combustion, 80(2), 187-206.
[21] Seo, J. H. & Mittal, R. (2011). A high-order immersed boundary method for acoustic wave scattering and low-Mach number flow-induced sound in complex geometries, Journal of Computational Physics, 230(4), 1000-1019.
[22] Lai, M. C. & Peskin, C. S. (2000). An immersed boundary method with formal second-order accuracy and reduced numerical viscosity, Journal of Computational Physics, 160(2), 705-719.
[23] Goldstein, D., Handler, R. & Sirovich, L. (1993). Modeling a no-slip flow boundary with an external force-field, Journal of Computational Physics, 105(2), 354-366.
[24] Saiki, E. M. & Biringen, S. (1996). Numerical simulation of a cylinder in uniform flow: Application of a virtual boundary method, Journal of Computational Physics, 123(2), 450-465.
[25] Mohd-Yusof, J. (1997). Combined immersed boundaries/b-splines methods for simulations of flows in complex geometries. NASA Ames/Stanford University: CTR Annual Research Briefs.
[26] Fadlun, E. A., Verzicco, R., Orlandi, P. & Mohd-Yusof, J. (2000). Combined immersed-boundary finite-difference methods for three-dimensional complex flow simulations, Journal of Computational Physics, 161(1), 35-60.
[27] Zhang, N. & Zheng, Z. C. (2007). An improved direct-forcing immersed-boundary method for finite difference applications, Journal of Computational Physics, 221(1), 250-268.
81
[28] Ikeno, T. & Kajishima, T. (2007). Finite-difference immersed boundary method consistent with wall conditions for incompressible turbulent flow simulations, Journal of Computational Physics, 226(2), 1485-1508.
[29] Iaccarino, G. & Verzicco, R. (2003). Immersed boundary technique for turbulent flow simulations, Applied Mechanics Reviews, 56(3), 331-347.
[30] Kim, J., Kim, D. & Choi, H. (2001). An immersed-boundary finite-volume method for simulations of flow in complex geometries, Journal of Computational Physics, 171(1), 132-150.
[31] Tseng, Y. H. & Ferziger, J. H. (2003). A ghost-cell immersed boundary method for flow in complex geometry, Journal of Computational Physics, 192(2), 593-623.
[32] Balaras, E. (2004). Modeling complex boundaries using an external force field on fixed Cartesian grids in large-eddy simulations, Computers & Fluids, 33(3), 375-404.
[33] Yang, J. M. & Balaras, E. (2006). An embedded-boundary formulation for large-eddy simulation of turbulent flows interacting with moving boundaries, Journal of Computational Physics, 215(1), 12-40.
[34] Kim, D. & Choi, H. (2006). Immersed boundary method for flow around an arbitrarily moving body, Journal of Computational Physics, 212(2), 662-680.
[35] Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., …. Purcell, T. J. (2007). A survey of general-purpose computation on graphics hardware, Computer Graphics Forum, 26(1), 80-113.
[36] Hagen, T. R., Lie, K. A. & Natvig, J. R. (2006). Solving the Euler equations on graphics processing units, Proceedings of the Sixth International Conference on Computational Science, Part IV, (pp. 220-227). United Kingdom: Reading, May 28-31.
[37] Brandvik, T. & Pullan, G. (2007). Acceleration of a two-dimensional Euler flow solver using commodity graphics hardware, Proceedings of the Institution of Mechanical Engineers Part C-Journal of Mechanical Engineering Science, 221(12), 1745-1748.
[38] Brandvik, T. & Pullan, G. (2008). Acceleration of a 3D Euler solver using commodity graphics hardware (AIAA Paper 2008-607), 46th AIAA Aerospace Sciences Meeting and Exhibit. Reno, USA: January 7-10.
[39] Elsen, E., LeGresley, P. & Darve, E. (2008). Large calculation of the flow over a hypersonic vehicle using a GPU, Journal of Computational Physics, 227(24), 10148-10161.
[40] Corrigan, A., Camelli, F. F., Lohner, R. & Wallin, J. (2011). Running unstructured grid-based CFD solvers on modern graphics hardware, International Journal for Numerical Methods in Fluids, 66(2), 221-229.
82
[41] Tolke, J. & Krafczyk, M. (2008). TeraFLOP computing on a desktop PC with GPUs for 3D CFD, International Journal of Computational Fluid Dynamics, 22(7), 443-456.
[42] Zhao, Y. (2008). Lattice boltzmann based PDE solver on the GPU, Visual Computer, 24(5), 323-333.
[43] Che, S., Boyer, M., Meng, J. Y., Tarjan, D., Sheaffer, J. W. & Skadron, K. (2008). A performance study of general-purpose applications on graphics processors using CUDA, Journal of Parallel and Distributed Computing, 68(10), 1370-1380.
[44] Rossinelli, D. & Koumoutsakos, P. (2008). Vortex methods for incompressible flow simulations on the GPU, Visual Computer, 24(7-9), 699-708.
[45] Wang, P., Abel, T. & Kaehler, R. (2010). Adaptive mesh fluid simulations on GPU, New Astronomy, 15(7), 581-589.
[46] Rossinelli, D., Bergdorf, M., Cottet, G. H. & Koumoutsakos, P. (2010). GPU accelerated simulations of bluff body flows using vortex particle methods, Journal of Computational Physics, 229(9), 3316-3333.
[47] Kampolis, I. C., Trompoukis, X. S., Asouti, V. G. & Giannakoglou, K. C. (2010). CFD-based analysis and two-level aerodynamic optimization on graphics processing units, Computer Methods in Applied Mechanics and Engineering, 199(9-12), 712-722.
[48] Xian, W. & Takayuki, A. (2011). Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster, Parallel Computing, 37(9), 521-535.
[49] Hoffmann, K. A. & Chiang, S. T. (2000). Computational Fluid Dynamics (4th ed., Vol. 1). Vichita, USA: Engineering Education System.
[50] Ferziger, J. H. & Peric, M. (2002). Computational Methods for Fluid Dynamics. New york, USA: Springer-Verlag.
[51] Hoffmann, K. A. & Chiang, S. T. (2000). Computational Fluid Dynamics (4th ed., Vol. 3). Vichita, USA: Engineering Education System.
[52] Ekaterinaris, J. A. (2005). High-order accurate, low numerical diffusion methods for aerodynamics, Progress in Aerospace Sciences, 41(3-4), 192-300.
[53] Laizet, S. & Lamballais, E. (2009). High-order compact schemes for incompressible flows: A simple and efficient method with quasi-spectral accuracy, Journal of Computational Physics, 228(16), 5989-6015.
[54] Uzun, A., Blaisdell, G. A. & Lyrintzis, A. S. (2004). Application of compact schemes to large eddy simulation of turbulent jets, Journal of Scientific Computing, 21(3), 283-319.
[55] Gaitonde, D. V. & Visbal, M. R. (2000). Pade-type higher-older boundary filters for the Navier-Stokes equations, Aiaa Journal, 38(11), 2103-2112.
83
[56] Beyer, R. P. & Leveque, R. J. (1992). Analysis of a one-dimensional model for the immersed boundary method, Siam Journal on Numerical Analysis, 29(2), 332-364.
[57] Unverdi, S. O. & Tryggvason, G. (1992). A front-tracking method for viscous, incompressible, multi-fluid flows, Journal of Computational Physics, 100(1), 25-37.
[58] Zhu, L. D. & Peskin, C. S. (2003). Interaction of two flapping filaments in a flowing soap film, Physics of Fluids, 15(7), 1954-1960.
[59] Stockie, J. M. & Wetton, B. R. (1999). Analysis of stiffness in the immersed boundary method and implications for time-stepping schemes, Journal of Computational Physics, 154(1), 41-64.
[60] Layton, S. K., Krishnan, A. & Barba, L. A. (2011). cuIBM – A GPU-accelerated immersed boundary method, International Conference on Parallel Computational Fluid Dynamics, Barcelona, Spain: May 16-20.
[61] Verzicco, R., Mohd-Yusof, J., Orlandi, P. & Haworth, D. (2000). Large eddy simulation in complex geometric configurations using boundary body forces, Aiaa Journal, 38(3), 427-433.
[62] Ferziger, J. H. & Peric, M. (2002). Computational Methods for Fluid Dynamics. Berlin: Springer-Verlag.
[63] Bell, N. & Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors, Conference on High Performance Computing, Networking, Storage and Analysis, Portland, Oregon, USA: November 14-20.
[64] Vázquez, F., Ortega, G., Fernández, J. J. & Garzón E. M. (2010). Improving the performance of the sparse matrix vector product with GPUs, The 10th IEEE International Conference on Computer and Information Technology, Bradford, UK: June 29-July 01.
[65] Hwu, W., Keutzer, K. & Mattson, T. G. (2008). The concurrency challenge, IEEE Design and Test of Computers, 25(4), 312-320.
[66] Kirk, D. B. & Hwu, W. (2010). Programming Massively Parallel Processors. Burlington, USA: Morgan Kaufmann.
[67] Trapnell, C. & Schatz, M. C. (2009). Optimizing data intensive GPGPU computations for DNA sequence alignment, Parallel Computing, 35(8-9), 429-440.
[68] Shi, H. X., Schmidt, B., Liu, W. G. & Muller-Wittig , W. (2009). Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA, 2009 Ieee International Symposium on Parallel & Distributed Processing (Vol. 1-5, pp. 1546-1553). Rome, Italy: May 25-29
[69] Klockner, A., Warburton, T., Bridge, J. & Hesthaven, J. S. (2009). Nodal discontinuous Galerkin methods on graphics processors, Journal of Computational Physics, 228(21), 7863-7882.
84
[70] Ma, W. J. & Agrawal, G. (2009). A compiler and runtime system for enabling data mining applications on GPUs, Acm Sigplan Notices, 44(4), 287-288.
[71] Sintorn, E. & Assarsson, U. (2008). Fast parallel GPU-sorting using a hybrid algorithm, Journal of Parallel and Distributed Computing, 68(10), 1381-1388.
[72] McGraw, T. & Nadar, M. (2007). Stochastic DT-MRI connectivity mapping on the GPU, Ieee Transactions on Visualization and Computer Graphics, 13(6), 1504-1511.
[73] Stone, S. S., Haldar, J. P., Tsao, S. C., Hwu, W. M. W., …. Liang, Z. P. (2008). Accelerating advanced MRI reconstructions on GPUs, Journal of Parallel and Distributed Computing, 68(10), 1307-1318.
[74] Maintz, S., Eck, B. & Dronskowski, R. (2011). Speeding up plane-wave electronic-structure calculations using graphics-processing units, Computer Physics Communications, 182(7), 1421-1427.
[75] Harvey, M. J. & De Fabritiis, G. (2009). An Implementation of the Smooth Particle Mesh Ewald Method on GPU Hardware, Journal of Chemical Theory and Computation, 5(9), 2371-2377.
[76] Harris, M. J. (2004) Fast Fluid Dynamics Simulation on the GPU, in R. Fernando (editor), GPU Gems (pp. 637-665). Boston: Pearson Education.
[77] Crane, K., Llamas, I. & Tariq, S. (2007) Real-time Simulation and Rendering of 3D Fluids, in H. Nguyen (editor), GPU Gems 3 (pp. 633-675). Boston: Pearson Education.
[78] Göddeke, D., Strzodka, R., Mohd-Yusof, J., McCormick, P., …. Turek, S. (2008). Using GPUs to improve multigrid solver performance on a cluster, International journal of Computational Science and Engineering, 4(1), 36-55.
[79] Kuznik, F., Obrecht, C., Rusaouen, G. & Roux, J. J. (2010). LBM based flow simulation using GPU computing processor, Computers & Mathematics with Applications, 59(7), 2380-2392.
[80] Asouti, V. G., Trompoukis, X. S., Kampolis, I. C. & Giannakoglou, K. C. (2011). Unsteady CFD computations using vertex-centered finite volumes for unstructured grids on graphics processing units, International Journal for Numerical Methods in Fluids, 67(2), 232-246.
[81] Thibault, J. C. & Senocak, I. (2009), CUDA implementation of a Navier-Stokes solver on multi-GPU desktop platforms for incompressible flows (AIAA Paper 2009-758), 47th AIAA Aerospace Sciences Meeting and exhibit. Orlando, USA: January 5-8.
[82] Antoniou, A. S., Karantasis, K. I., …. Ekaterinaris, J. A. (2010). Acceleration of a finite-difference WENO scheme for large-scale simulations on many-core architectures (AIAA Paper 2010-525), 48th AIAA Aerospace Sciences Meeting and exhibit. Orlando, USA: January 4-7.
85
[83] Castonguay, P., Williams, D., Vincent, P., Lopez, M. & Jameson, A. (2011). On the development of a high-order, multi-GPU enabled, compressible viscous flow solver for mixed unstructured grids, 20th AIAA Computational Fluid Dynamics Conference, Honolulu, Hawaii: June 27-30.
[84] Url-1 <http://www.nvidia.com/docs/IO/56483/Tesla_C1060_boardSpec_ v03.pdf>, date retrieved 02.11.2012.
[85] Tutkun, B. & Edis, F. O. (2012). A GPU application for high-order compact finite difference scheme, Computers & Fluids, 55, 29-35.
[86] Goddeke, D. & Strzodka, R. (2011). Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision multigrid, Ieee Transactions on Parallel and Distributed Systems, 22(1), 22-32.
[87] Zhang, Y., Cohen, J. & Owens, J. D. (2010). Fast tridiagonal solvers on the GPU, Ppopp 2010. Proceedings of the 2010 Acm Sigplan Symposium on Principles and Practice of Parallel Programming, (pp. 127-136). Bangalore, India: January 9-14.
[88] Url-2 <http://www.serc.iisc.ernet.in/facilities/ComputingFacilities/systems/ fermi/CUDA_C_Best_Practices_Guide.pdf>, date retrieved 02.11.2012.
[89] Url-3 <http://www.scribd.com/doc/49843063/CUDA-C-Programming-Guide>, date retrieved 02.11.2012.
[90] Lesieur, M., Staquet, C., Leroy, P. & Comte, P. (1988). The mixing layer and its coherence examined from the point of view of two-dimensional turbulence, Journal of Fluid Mechanics, 192, 511-534.
[91] Lesieur, M., Comte, P., Chollet, J. P. & Le Roy, P. (1987). Numerical simulation of coherent structures in free shear flows, NATO Advanced Research Workshop on Mathematical Modeling in Combustion and Related Topics, Lyon, France: April 27-30.
[92] Ozel, A. (2006). Computational fluid dynamics analysis of turbulent flows (Master's thesis). Istanbul Technical University, Graduate School of Science Engineering and Technology, İstanbul.
[93] Williamson, C. H. K. (1996). Vortex dynamics in the cylinder wake, Annual Review of Fluid Mechanics, 28, 477-539.
[94] Okajima, A. (1982). Strouhal numbers of rectangular cylinders, Journal of Fluid Mechanics, 123, 379-398.
[95] Kelkar, K. M. & Patankar, S. V. (1992). Numerical prediction of vortex shedding behind a square cylinder, International Journal for Numerical Methods in Fluids, 14(3), 327-341.
[96] Sohankar, A., Norberg, C. & Davidson, L. (1998). Low-Reynolds-number flow around a square cylinder at incidence: Study of blockage, onset of vortex shedding and outlet boundary condition, International Journal for Numerical Methods in Fluids, 26(1), 39-56.
86
[97] Saha, A. K., Biswas, G. & Muralidhar, K. (2003). Three-dimensional study of flow past a square cylinder at low Reynolds numbers, International Journal of Heat and Fluid Flow, 24(1), 54-66.
[98] Robichaux, J., Balachandar, S. & Vanka, S. P. (1999). Three-dimensional Floquet instability of the wake of square cylinder, Physics of Fluids, 11(3), 560-578.
[99] Sharma, A. & Eswaran, V. (2004). Heat and fluid flow across a square cylinder in the two-dimensional laminar flow regime, Numerical Heat Transfer Part A-Applications, 45(3), 247-269.
[100] Sahu, A. K., Chhabra, R. P. & Eswaran, V. (2009). Two-dimensional unsteady laminar flow of a power law fluid across a square cylinder, Journal of Non-Newtonian Fluid Mechanics, 160(2-3), 157-167.
[101] Singh, A. P., De, A. K., Carpenter, V. K., Eswaran, V. & Muralidhar, K. (2009). Flow past a transversely oscillating square cylinder in free stream at low Reynolds numbers, International Journal for Numerical Methods in Fluids, 61(6), 658-682.
[102] Sen, S., Mittal, S. & Biswas, G. (2011). Flow past a square cylinder at low Reynolds numbers, International Journal for Numerical Methods in Fluids, 67(9), 1160-1174.
[103] Perry, A. E., Chong, M. S. & Lim, T. T. (1982). The Vortex-Shedding Process Behind Two-Dimensional Bluff-Bodies, Journal of Fluid Mechanics, 116, 77-90.
[104] Yoon, D. H., Yang, K. S. & Choi, C. B. (2010). Flow past a square cylinder with an angle of incidence, Physics of Fluids, 22(4).
[105] Sohankar, A., Norberg, C. & Davidson, L. (1999). Simulation of three-dimensional flow around a square cylinder at moderate Reynolds numbers, Physics of Fluids, 11(2), 288-306.
[106] Franke, R., Rodi, W. & Schonung, B. (1990). Numerical-calculation of laminar vortex-shedding flow past cylinders, Journal of Wind Engineering and Industrial Aerodynamics, 35(1-3), 237-257.
[107] Davis, R. W., Moore, E. F. & Purtell, L. P. (1984). A Numerical-experimental study of confined flow around rectangular cylinders, Physics of Fluids, 27(1), 46-59.
[108] Url-4 <http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/ docs/ CUSPARSE_Library.pdf>, date retrieved 02.11.2012.
[109] Url-5 <http://code.google.com/p/cusp-library/>, date retrieved 02.11.2012.
87
APPENDICES
APPENDIX A: Fourier Error Analysis of Some Central Finite Difference
Schemes for First and Second Derivatives
In this appendix, Fourier error analyses of some central finite difference schemes are
presented below for the first and second derivatives. For the first derivative:
Fourth-order Pade scheme:
HALM} + 4HA} + HANM} = 3$HANM − HALM∆� ( (A.1)
C0HANJ = F3∗�@JD∆0�@D0w = F3∗�@JD∆0HA (A.2)
F3∗HA~�L@D∆0 + 4 + �@D∆0� = 3∆� ~�@D∆0 − �L@D∆0�HA (A.3)
�@D∆0 = cosZ3∆�\ + F|FGZ3∆�\ (A.4)
3∗∆� = 3|FGZ3∆�\2 + W�|Z3∆�\ (A.5)
Eight-order pentadiagonal compact scheme:
HAL&} + 16HALM} + 36HA} + 16HANM} + HAN&} = 1603 $HANM − HALM2∆� ( + 503 $HAN& − HAL&4∆� ( (A.6)
F3∗HA~�L@&D∆0 + 16�L@D∆0 + 36 + 16�@D∆0 + �@&D∆0�= �1603 $�@D∆0 − �L@D∆02∆� ( + 503 $�
@&D∆0 − �L@&D∆04∆� (�HA (A.7)
3∗∆� = 320|FGZ3∆�\ + 50sinZ23∆�\216 + 12W�|Z23∆�\ + 192cosZ3∆�\ (A.8)
88
Tenth-order pentadiagonal compact scheme:
HAL&} + 10HALM} + 20HA} + 10HANM} + HAN&}= 853 $HANM − HALM2∆� ( + 20215 $HAN& − HAL&4∆� ( + 15$HANe − HALe6∆� ( (A.9)
F3∗HA~�L@&D∆0 + 10�L@D∆0 + 20 + 10�@D∆0 + �@&D∆0�= �853 $�
@D∆0 − �L@D∆02∆� ( + 20215 $�@&D∆0 − �L@&D∆04∆� (
+ 15$�@eD∆0 − �L@eD∆06∆� (� HA
(A.10)
3∗∆� = 850|FGZ3∆�\ + 202sinZ23∆�\ + 2sinZ33∆�\600 + 60W�|Z23∆�\ + 600cosZ3∆�\ (A.11)
And for the second derivative:
Fourth-order Pade scheme:
HALM}} + 10HA}} + HANM}} = 12$HANM − 2HA + HALM∆�& ( (A.12)
−3∗&~�L@D∆0 + 10 + �@D∆0� = 12∆�& ~�@D∆0 − 2 + �L@D∆0� (A.13)
3∗&∆�& = 12 − 12W�|Z3∆�\5 + W�|Z3∆�\ (A.14)
Eight-order pentadiagonal compact scheme:
HAL&}} + 68823 HALM}} + 235823 HA}} + 68823 HANM}} + HAN&}}= 192023 $HANM − 2HA + HALM∆�& ( + 186023 $HAN& − 2HA + HAL&4∆�& (
(A.15)
89
−3∗& P�L@&D∆0 + 68823 �L@D∆0 + 235823 + 68823 �@D∆0 + �@&D∆0Q= 192023 $�@D∆0 − 2 + �L@D∆0∆�& (+ 186023 $�@&D∆0 − 2 + �L@&D∆04∆�& (
(A.16)
3∗&∆�& = 6384092 9~2 − 2W�|Z3∆�\� + 693092 9 Z2 − 2cosZ23∆�\\6117923 9 + W�|Z23∆�\ + Z68823 \cosZ3∆�\ (A.17)
Tenth-order pentadiagonal compact scheme:
HAL&}} + 66843 HALM}} + 179843 HA}} + 66843 HANM}} + HAN&}}= 106543 $HANM − 2HA + HALM∆�& ( + 207643 $HAN& − 2HA + HAL&4∆�& (+ 7943$HANe − 2HA + HALe9∆�& (
(A.15)
−3∗& P�L@&D∆0 + 66843 �L@D∆0 + 179843 + 66843 �@D∆0 + �@&D∆0Q= 106543 $�@D∆0 − 2 + �L@D∆0∆�& (+ 207643 $�@&D∆0 − 2 + �L@&D∆04∆�& (+ 7943$�
@eD∆0 − 2 + �L@eD∆09∆�& (
(A.16)
3∗&∆�&= ~½�r�er¾ �~1 − W�|Z3∆�\� + ~np¾Mer¾ �Z1 − cosZ23∆�\\ + ~ ¾½er¾�Z1 − cosZ33∆�\\r½½ne + W�|Z23∆�\ + pprne cosZ3∆�\ (A.17)
91
CURRICULUM VITAE
Name Surname: Bülent Tutkun
Place and Date of Birth: Istanbul / 1974
Address: Istanbul Teknik Üniversitesi, Uçak ve Uzay Bil. Fakültesi
E-Mail: [email protected]
B.Sc.: Astronautical Engineering
M.Sc.: Astronautical Engineering
Professional Experience and Rewards:
Research assistant in ITU, Faculty of Aeronautics and Astronautics since 2001.
List of Publications and Patents:
� Kurtulus, C., Baltaci, T., Ulusoy, M., Aydin, B. T., Tutkun, B . et al. (2007). ITU-pSAT I: Istanbul Technical University Student Pico-Satellite Program, 3rd International Conference on Recent Advances in Space Technologies, Istanbul, Turkey: June 14-16.
� Tutkun, B. & Meric, R. A. (2002). Comparison of turbulence models for transonic flow around an airfoil, Kayseri 4th National Aviation Symposium, Kayseri, Erciyes University: May 13-15.
� Tutkun, B. & Meric, R. A. (2001). Analysis of thermal flows around an airfoil in the canal with finite volume method and grid optimization, 12th National Mechanics Congress, Konya, Selcuk University: September 10-14.
PUBLICATIONS/PRESENTATIONS ON THE THESIS
� Tutkun B. & Edis F.O. (2012). A GPU application for high-order compact finite difference scheme, Computers & Fluids, 55, 29-35.
� Tutkun B. & Edis F.O. (2012). A GPU implementation for the direct-forcing immersed boundary method, 24th International Conference on Parallel Computational Fluid Dynamics, Atlanta, USA: May 21-25.