11
1 Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 Nikos Trikoupis, Paul Muzio CUNY High Performance Computing Center City University of New York {nikolaos.trikoupis, [email protected]} Abstract We focus on measuring the aggregate throughput delivered by 12 Intel® SSD DC P3700 for NVMe cards installed on the SGI UV 300 scale-up system in the City University of New York (CUNY) High Performance Computing Center (HPCC). We establish a performance baseline for a single SSD. The 12 SSDs are assembled into a single RAID-0 volume using Linux Software RAID and the XVM Volume Manager. The aggregate read and write throughput is measured against different configurations that include the XFS and the GPFS file systems. We show that for some configurations throughput scales almost linearly when compared to the single DC P3700 baseline. 1. Introduction Flash storage is enjoying growing popularity. Typically PCIe NVMe Flash Storage is used to bridge the latency gap between DRAM-based main memory and disk-based non-volatile storage, particularly for applications with random access patterns. Here, random IOPS performance of SSDs is generally the characteristic attracting the most attention. This is understandable, since designing a file system with high IOPS performance is a major cost driver. There are, however, still many workloads in high- performance computing (HPC) that can benefit from the improved sequential access throughput performance provided by SSDs. For sequential read/writes, a single Intel P3700 2 TB add-in-card can deliver 2.8 GB/s for reads and 2 GB/s for sequential writes [1]. As SGI demonstrated at the Supercomputing 2014 conference [2] the UV 300 scale-up x86 system is capable of hosting up to 64 Intel P3700 cards with an aggregate performance of 180GB/s of sequential throughput. Other than SGI's Supercomputing announcements, to the best of our knowledge there have been no other benchmarks demonstrating this kind of scaling on a single x86 Linux- based system using different volume managers or file systems. 2. Motivation and Background Transferring large data sets in and out of a server node with a DRAM configuration of 12-TB is part of a Proof of Concept (POC) project at the CUNY HPCC investigating the potential performance benefits of PCIe NVMe Flash Storage for kdb+ by Kx Systems [3], a popular commercial database used globally for high performance, data-intensive applications. The datasets the database typically operates against can grow anywhere from a few terabytes to multiple petabytes. The majority of the workloads are highly sequential, read intensive I/O. The results described on the following pages are the initial result from this POC project.

Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , [email protected]}

Embed Size (px)

Citation preview

Page 1: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

1

ThroughputperformanceevaluationoftheIntel®SSDDCP3700forNVMeontheSGI®UV™300

NikosTrikoupis,PaulMuzio

CUNYHighPerformanceComputingCenterCityUniversityofNewYork

{nikolaos.trikoupis,[email protected]}

AbstractWefocusonmeasuringtheaggregatethroughputdeliveredby12 Intel®SSDDCP3700forNVMecardsinstalledontheSGIUV300scale-upsystemintheCityUniversityofNewYork(CUNY)HighPerformanceComputing Center (HPCC). We establish a performance baseline for a single SSD. The 12 SSDs areassembled into a single RAID-0 volumeusing Linux SoftwareRAID and the XVMVolumeManager. Theaggregate read andwrite throughput ismeasured against different configurations that include the XFSandtheGPFSfilesystems.WeshowthatforsomeconfigurationsthroughputscalesalmostlinearlywhencomparedtothesingleDCP3700baseline.

1.IntroductionFlash storage is enjoying growing popularity. Typically PCIe NVMe Flash Storage is used to bridge thelatency gap between DRAM-based main memory and disk-based non-volatile storage, particularly forapplications with random access patterns. Here, random IOPS performance of SSDs is generally thecharacteristic attracting themost attention. This is understandable, since designing a file systemwithhigh IOPS performance is a major cost driver. There are, however, still many workloads in high-performance computing (HPC) that can benefit from the improved sequential access throughputperformanceprovidedbySSDs.Forsequentialread/writes,asingleIntelP37002TBadd-in-cardcandeliver2.8GB/sforreadsand2GB/sfor sequentialwrites [1].As SGIdemonstratedat theSupercomputing2014 conference [2] theUV300scale-upx86system iscapableofhostingupto64 IntelP3700cardswithanaggregateperformanceof180GB/sofsequentialthroughput.OtherthanSGI'sSupercomputingannouncements,tothebestofourknowledgetherehavebeennootherbenchmarksdemonstratingthiskindofscalingonasinglex86Linux-basedsystemusingdifferentvolumemanagersorfilesystems.

2.MotivationandBackgroundTransferringlargedatasets inandoutofaservernodewithaDRAMconfigurationof12-TBispartofaProofofConcept(POC)projectattheCUNYHPCCinvestigatingthepotentialperformancebenefitsofPCIeNVMeFlash Storage for kdb+byKx Systems [3], apopular commercial databaseusedglobally forhighperformance,data-intensiveapplications.Thedatasetsthedatabasetypicallyoperatesagainstcangrowanywhere from a few terabytes to multiple petabytes. The majority of the workloads are highlysequential,readintensiveI/O.TheresultsdescribedonthefollowingpagesaretheinitialresultfromthisPOCproject.

Page 2: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

2

2.1.HardwareArchitectureTheCUNYHPCC SGIUV 300 is named “APPEL”, in honor ofDr. KennethAppel, an alumnus ofQueensCollege/CUNY,knownforhisworkintopologyand,inparticular,provingthefourcolormaptheoremwithWolfgangHaken in1976. APPEL is amultiprocessordistributed sharedmemory (DSM) systemwith32IntelXeonIntelXeonE7-8857v23.00GHzprocessors,(384IvyBridgeCPUcores),12-TBofDDR3memory,12NVIDIAK20mGPGPUs,and2IntelXeonPHIKNC2255co-processors.AsingleLinuxkernelismanagingall devices and sharing the memory of the system. The operating system is a standard SuSE LinuxEnterpriseServer11distribution."APPEL"isaCacheCoherentNon-UniformMemoryArchitecture(ccNUMA)system.Memoryisphysicallylocated at various distances from the processors. As a result, memory access times or latencies aredifferentornon-uniform.Forexample,inthediagrambelow,ittakeslesstimeforaprocessorlocatedintheleftunittoreferenceitslocallyinstalledmemorythantoreferenceremotememoryintherightunit.SGI's interconnect, NUMAlink v7, makes it so that all of the processors can work sharing the singlememoryspaceof12-TBwithaguaranteedlatencyoflessthan500nsforanymemoryreferencefromanyprocessor[4].

Diagram1:APPEL'ssystemarchitecture

A single-rack SGI UV 300 comeswith a total of 96 PCIe 3.0 slots. For the purposes of this testing,wedistributedtwelve2-TBIntelP3700SSDcardsintheavailablex16slotsacrossthesystemunits.Itshouldbenoted,however,thattheP3700onlyrequirex4width,allowingfortheaddition,ifrequired,ofmanymoreP3700s(orotherdevices)intheUV300configuration.ThelayoutisshowninDiagram2:

Diagram2:APPEL'sPCIelayout

Page 3: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

3

2.2.DriversandSoftwareNVMExpress(NVMe)driver.TheIntelP3700isthefirstgenerationofIntelSSDsbasedontheinnovativeNVMe protocol. The traditional Linux block layer had already become a bottleneck to storageperformance as the advent ofNAND Flash storage allowed systems to reach 800,000 IOPS [5]. NVMeallows for increasedparallelismbyprovidingmultiple I/Odispatchqueuesdistributedso that there isalocalqueueperNUMAnodeorprocessor.TheIntelP3700cardscansupportupto31NVMeI/Oqueuesandoneadminqueue.SGImodifiedtheNVMedrivertooptimizeitforitsUVsystemssothatthequeuesaredistributedevenlyacrossthemultipleCPUsocketsinthesystem.EachprocessmaynowhaveitsownI/Osubmissionandcompletionqueueandaninterruptforeachqueuetothestoragedevice,eliminatingthe need for remote memory accesses [6][7]. The versions of the driver are: sgi-nvme-kmp-default-1.0.0_3.0.76_0.11-sgi713a1.sles11sp3.x86_64.rpmandsgi-nvme-1.0.0-sgi713a1.sles11sp3.x86_64.rpm.Linux MD driver with Intel's RSTe extensions. We use the Multiple Device Driver, known as LinuxSoftware RAIDwith Intel RSTe (IMSMmetadata container), to create a striped RAID 0 device from alltwelveSSDs[8]. Besides itsownformatsforRAIDvolumemetadata,LinuxsoftwareRAIDalsosupportsexternalmetadataformats,suchasIntel’sRapidStorageTechnologyEnterprise(RSTe)extensions(IMSMmetadata container). The latest version of the mdadm userspace utility was downloaded fromhttp://git.neil.brown.name/?p=mdadm.git.XVMVolumeManager.SGI'sXVMisaNUMA-awarevolumemanageroptimizedfortheUV300system.SimilartoLinuxMD-RAID,weuseXVMtocreatealogicalvolumethatisstripedacrossalltwelveSSDs.XFSFilesystem.XFSisapopularhigh-performancejournalingfilesystem,whichiscapableofhandlingfilesystemsaslargeas9millionterabytes[9],andisusedatourHPCcenter.GPFSfilesystem.TheGeneralParallelFileSystem,recentlyrebrandedasSpectrumScale,andalsoinuseattheCUNYHPCCenter,isahigh-performanceparallelfilesystemdevelopedbyIBM.Ratherthanrelyingonstripinginaseparatevolumemanagerlayer,GPFSimplementsstripingatthefilesystemlevel[10].

3.MethodologyBecausethedatabaseinourPOCproject,KxSystems’,kdb+,demonstrateshighlysequentialreadI/Oforhistoric and real-timedata, weoptimize andmeasure the throughputperformanceof the Intel P3700SSDsunderavarietyofvolumemanagerandfilesystemconfigurations.Forthispurpose,weusethewell-knownIORbenchmark[11]developedbyLawrenceLivermoreNationalLaboratory.IOR, is anMPI-coordinated benchmark.We compile and execute it using theMessage Passing Toolkit(MPT)whichisSGI'sMPIimplementation.ThemostsignificantIORoptionsusedinourtestsare:

• -aPOSIX usethePOSIXAPIforI/O.• -B usedirectI/O,bypassingbuffers.• -ttransferSize sizeinbytesofasingleI/Otransactiontransferingdatafrommemorytothe

datafile.• -s segmentCount controlsthetotalamountofthesizeofthedatafile.• -N numTasks numberofMPIthreadsparticipatinginthetest.• -F filePerProc leteachthreadwritetotheirowndatafile.• -odirectories distributethedatafilesevenlyonthespecifieddirectories.Beforeallthetests,allSSDcardsweresecurelyerasedusingblkdiscard -v /dev/${DEV}.Eachtestwasruntwice.ThefirstsetofnumberswerediscardedinanefforttoapproachtheSSD'ssteadystate.Thesecondsetofnumberswererecordedandarebeingpresentedhere.

Page 4: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

4

Direct I/O (O_DIRECT flag) is commonly used with SSDs to boost bulk I/O operations by allowing theapplicationstowritedirectlytothestoragedevice,bypassingthekernelpagecache.Theusageoftheflaginproductionsystemsiscomplicated,becauseitrequiresasynchronouswriteswithalargequeuedepthandisalsocontroversial[12].However,consideringtheamountof12TBmemoryavailableinoursystem,wemadeaconsciousdecisiontouse itduringthesetests inanattempttokeepthethroughputresultsfrombeingobscuredduetocachingeffects.To establish a baseline,we first do a number of IOR test runswith a varying number of IOR threadsagainst a single P3700 formatted with XFS, and mounted with the noatime and nobarrier options toremove some unnecessary file system overhead. Common practice in measuring throughput is to uselargetransfersizes, typicallystarting from1MBandgoing forwardtowards4,8,or16MB.Wedecidedthatourbaselinewillbethebestreadandwritethroughputresultsobservedfortransfersizesof1MB,andthesearethenumberswearerecording.Next,weformateachof thetwelve installedP3700SSDswithXFSandmountthemonthesystem.Wecompletea seriesof readandwrite testsusing transfer sizesof1,4,8and16MB.Werepeat the testusing33,66,99,132,and264threadstofindtheminimumamountofserverthreadsrequiredtosaturatetheSSDs.UsingtheP3700cardsasindividualfilesystems,however,hasgenerallylimitedvalue.ItwouldbemoreinterestingtoassembletheminaRAIDdeviceandaggregatetheirstoragecapacityunderonefilesystem.For this, we use two different methods to create the single volume. One is with SGI's XVM VolumeManager. The other one is with Linux Software RAID (MD-RAID) with Intel Rapid Storage TechnologyEnterprise extension (IMSM metadata container). In both cases we create a striped, RAID-0 volume,withoutbeingconcernedaboutredundancyormirroring.WeformatthevolumewiththeXFSfilesystem.Asbefore,wecompleteaseriesofreadandwritetestswithtransfersizesof1to16MBandwerepeatusing33,66,99,132,and264threads.Finally, we format a single GPFS file system configuring each of the raw /dev/nvme* devices as NSDdrives.An importantconsiderationwhencreatingaGPFS filesystem is theselectionof theappropriateblocksize,avariablesetat format time. In theGPFSversion3.5.0.26 thatweuse for this test,allowedblocksizesarebetween64KBand8MB.WhenoptimizingaGPFSfilesystemforthroughput,largerblocksizesarepreferred,althoughthistendstowastestoragespace,particularlywhensmallfilesarestoredonthefilesystem.Werepeatedthesamecollectionoftestsasbefore,fortwoGPFSinstances,onewithablocksizeof1MBandonewith4MB.

3.1SystemTuningNVMe Queue Distribution. The SGI-modified NVMe driver calculates the optimal queue distributionacross all available processor sockets so that each socket gets at least one queue and generates CPUinterruptaffinity[6].SGI’sIRQbalancer,sgi_irqbalance,currentlyhasnoawarenessofthisdriversoitwasdisabled.Afterrebootingthesystem,thefollowingscriptwasusedtosettheaffinityinaccordancewiththedriverhintsforallIntelNVMedevicesinthesystem:# find /proc/irq/*/nvme* | cut -d/ -f4 | xargs -I '{}' \ sh -c "cat /proc/irq/{}/affinity_hint > /proc/irq/{}/smp_affinity"

C1Epowersettings.Topreventinactivesocketsenteringpowersavingmodeandimpactingmeasuredperformance,wedisabledC1EstateonallCPUsonthesystemusingthefollowingscript:# for p in $(seq $(sed 's/-/ /' /sys/devices/system/cpu/online)); \

Page 5: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

5

do wrmsr -p $p 0x1fc 0x35040041; done HyperthreadingisdisabledonthisUV300system.

4.TestresultsTheconfigurationofthetestsandtheirresultsarepresentedbelow.

4.1Test1:Baseline-OneIntelP3700cardformattedwithXFSConfigurationAsingleIntelP3700cardisformattedandmountedasfollows:# mkfs.xfs -f -K -d su=128k,sw=1 /dev/nvme0n1 # mount -t xfs /dev/nvme0n1 /p3700/0 Thecommandlineusedis:$ /scratch/nikos/IOR/src/C/IOR -a POSIX -B -e -t 1m -b 1m -s 10000 -N 6 -C –FResults• Writeresult:2.05GB/s,achievedusing6threadsandIORtransfersizeof1MB.• Readresult:2.78GB/sachievedusing6threadsandIORtransfersizeof1MB.Duringthistest,iostatreportscloseto100%utilizationforthe/dev/nvme*device.However,itshouldbestressed that IOR is a synthetic, best-case-scenariobenchmark; the realworld applicationperformancemaybeless.

4.2Test2:TwelveIntelP3700cards,oneXFSfilesystempercardConfiguration12 Intel P3700 cards are formatted as in Test 1 andmounted under 12 separatemount points below/p3700:# mount -t xfs /dev/nvme0n1 /p3700/0# mount -t xfs /dev/nvme11n1 /p3700/11 IORisrunusing33to264threads.WearetestingwithIORtransfersizesfrom1MBto16MB.Duringeachrun,IORiswritingonallIntelcardssimultaneously.ExamplecommandIORlineused:$ mpiexec_mpt /scratch/nikos/IOR/src/C/IOR -a POSIX -B -e -t 1m -b 1m \ -s 10000 -N 132 -C -F -k -o 0/0@1/1@2/2@3/3@4/4@5/5@6/6@7/7@8/8@9/9@10/10@11/11 Results• Bestwriteresult:24.1GB/s,usingatleast132threadsandtransfersizeof1MB.• Bestreadresult:32.9GB/s,usingatlest132threadsandtransfersizeof1MB.Duringthetest,iostatreportedcloseto100%utilizationfortheNVMedevices.

Page 6: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

6

Test 2.With33 IOR threads, 1MB reads showan81% scalability and 1 MB writes are at 90%scalability.With132IORthreads,thecomparablepercentagesareeachcloseto100%. This isduetothefactthatthereisnoadditionaloverheadonthe P3700 cards over the single-card, single-filesystem setup. This is important in the context ofour POC project, since our application can beconfiguredwithmultipleI/Oprocesses,eachusinga separate physical directory for its own dataset.This setup gave the best overall performanceamong all our tests. Note that the results of allfour write tests are shown in the chart and soclosely overlap in performance that the distinctlinesarenoteasilyvisible.

4.3Test3:TwelveIntelP3700cards,oneGPFSfilesystemwith1MblocksizeConfigurationFor this test,GPFS version v3.5.0.26 is set up as a single file systemwith eachof the raw /dev/nvme*devicesasNSDdrives.Itiscreatedusing1Masblocksizeandmountedasfollows:# mmcrfs p3700 -F p3700.lst –B 1M -v no -n 32 -j scatter -T/global/p3700 -A no # mmmount /global/p3700 -o dio,nomtime,noatime TestsareruninsidetheNSDserver,theUV300itself.Noremoteclientsareinvolvedinanytest.IORisrunusing33to264threadswithIORtransfersizesfrom1MBto16MB.ExampleIORcommandlineused:$ mpiexec_mpt /scratch/nikos/IOR/src/C/IOR -a POSIX -B -e -t 1m -b 1m \ -s 1000 –N 33 -C -F –k Results• Best read result: 31.6 GB/s, with 66 threads and transfer size of 16 MB, although very similar

numberswereobservedusing1MBoftransfersize.• Bestwriteresult:18.8GB/s,using33threadsandtransfersizeof16MB.

Test 3. Although read performance using GPFSwas comparable to the results from Test 2 usingXFSwith 95% scalability at 66 IOR threads,writeperformance was decidedly not, with only 31%scalability.WeexpectthatnewerversionsofGPFSwith tuning for NVMe drives will show betterperformanceonwrites.

Page 7: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

7

4.4Test4:TwelveIntelDCP3700cards,oneGPFSfilesystemwith4MblocksizeConfigurationGPFS version v3.5.0.26 is set upwith each of the raw /dev/nvme* devices asNSD drives. It is createdusing4Masblocksizeandmountedasfollows:# mmcrfs p3700 -F p3700.lst –B 4M -v no -n 32 -j scatter -T/global/p3700 -A no # mmmount /global/p3700 -o dio,nomtime,noatime No remoteclientswere involved. IOR is runusing33 to264 threadsand testedwith IOR transfer sizesfrom1MBto16MB.Examplecommandlineused:$ mpiexec_mpt /scratch/nikos/IOR/src/C/IOR -a POSIX -B -e -t 1m -b 1m -s 1000 –N 66 -C -F –k Results• Best read result: 31.3GB/s, using 99 threads and IOR transfer size of 16MB (although very similar

numberswereachievedusing1MBtransfersize).• Bestwriteresult:22.4GB/s,using66threadsandIORtransfersizeof16MB.

Test4.Formattingusingalarger,4MBGPFSblocksize, improved performance for large sequentialwritescomparedtoTest3,but results still laggedcompared to Test 2 with XFS. Again, we expectthatnewerversionsofGPFSwithtuningforNVMedrives will show better performance on writes.Readperformanceremainedexcellent forasinglefilesystemspreadover12NVMecards.

4.5Test5a:TwelveIntelP3700cards,MD-RAID,oneXFSfilesystemConfigurationInthistest,all/dev/nvme*devicesareassembledinaRAID-0arrayusingLinuxSoftwareRAID(MD-RAID).Ontopofthearray,anXFSfilesystemislaidout,usingthedefaultformattingoptions:# mdadm --create /dev/md0 --chunk=128 --level=0 --raid-devices=12 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 # mkfs.xfs -k /dev/md0 # mount -o noatime,nodiratime,nobarrier /dev/md0 /scratch_ssd/ IORisrunusing33to264threads,andoneachroundwetestwithIORtransfersizesfrom1MBto16MB.WeuseIORwiththePOSIXAPI,onefile-per-process,doingsequentialwrites.Examplecommandline:$ mpiexec_mpt /scratch/nikos/IOR/src/C/IOR -a POSIX -B -e -t 1m -b 1m -s 10000 -N 66 -C -F -k

Page 8: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

8

Results• Bestreadresult:29.8GB/s,using66threadsandtransfersizeof4MB.• Bestwriteresult:23.9GB/s,using66threadsandtransfersizeof8MB.Duringthetest,iostatreportscloseto100%utilizationforallNVMedevices.

Test 5a. This setup using 1 XFS file system overMD-RAID gave the excellent and the mostconsistent performance among the three priorscenarios.

4.6Test5b:TwelveIntelP3700cards,MD-RAIDwithRSTe,oneXFSfilesystemConfigurationThis test is very similar to Test 5a,with thedifference thatweare creating the volumeusing the IntelStorage Technology Enterprise RAID metadata format, which enables Intel RSTe features [13], such ascreatingmultiplevolumeswithinthesamearray.Also,insteadofformattingtheraw/dev/md0device,wepartitionitfirst.Weusethefollowingcommandstosetthearray,partition,formatandmountit:# mdadm -C /dev/md/imsm /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 -n 12 -e imsm -f # mdadm -C /dev/md0 /dev/md/imsm -n 12 -l 0 -f # parted /dev/md0 mklabel gpt # parted /dev/md0 mkpart primary 2097152B 100% # parted /dev/md0 align-check opt 1 # mkfs.xfs -f -K /dev/md0p1 # mount -o noatime,nodiratime,nobarrier /dev/md0p1 /scratch_ssd/ Asbefore,IORisrunusing33to264threads,andoneachroundwetestwithIORtransfersizesnetween1MBto16MBResults• Bestreadresult:29.9GB/susing66threadsandtransfersizeof4MB.• Bestwriteresult:24.2GB/susing66threadsandtransfersizeof8MB.Duringthetest,iostatreportscloseto100%utilizationforallNVMedevices.

Page 9: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

9

Test 5b. This setup using 1 XFS file system overMD-RAID/IMSM, gave excellent performance,marginally better to that of MD-RAID, but theperformancewasslightlylessconsistentacrossthenumberofIORthreads.

4.7Test6:TwelveIntelDCP3700cards,XVMvolumemanager,oneXFSfilesystemConfigurationForthistestwe install the latestXVM3.4binariesfromSGI.All/dev/nvme*devicesareassembled inabasicstripedXVMvolumeusingSGI'sxvmgr.Ontopofthevolume,anXFSfilesystemwaslaidout,usingdefaultformattingoptions.IORisrunusing33to264threads,andoneachroundwetestwithIORtransfersizesbetween1MBand16MB. ResultsBestreadresult:28.9GB/s,using66threadsandtransfersizeof16MB.Bestwriteresult:23.8GB/s,using66threadsandtransfersizeof8MB.

Test 6. The results for this setup were similar toTests 5a and 5b (XFS over MD-RAID), with thelatter two giving marginally better and moreconsistentperformance.

Page 10: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

10

6.ConclusionWeevaluatedthesequentialthroughputperformanceofacollectionof12IntelP3700NVMeFlashcardsinstalledona single SGIUV300 system. Although theUV300 can supportmanymore SSD cards, oursystemhasonly12installed.ThesecardsaretypicallyinstalledinHPCenvironmentsaimingforhighIOPS.We found that they offer excellent throughput performance, slightly better than the publishedspecifications,forsequentialreadsandwrites.WetestedtheDCP3700SSDsusingdifferentfilesystemsandvolumemanagers.Throughputdisplayedexcellent scalarperformancecompared to the single cardbaselinenumberswithverylittleoverhead.Theconfiguration thatgave thebestoverallperformanceamongallour tests,32.9GB/s readand24.1GB/swrite,isinTest2wherewedonotputasinglevolumeontopofalltheSSDs,butwhenweusethemformatted and mounted as individual file systems. This is acceptable for our POC project, since ourapplicationcanbeconfiguredwithmultiple I/Oprocesses,eachusingaseparatephysicaldirectorywithitsowndataset.Following closely are the results in Test 5b,wherewe stripe the SSDs using Linux Software RAIDwithIntel's IMSMextensions and achieving a throughput of 29.9MB/s read and 24.2MB/swrite. Althoughwritethroughputisthesameasinthebest-casescenariowheretheSSDsareindividuallyformattedandmounted,thereseemstobea9.1%taxinreadthroughput.Webelievethemostprobablereasonforthis,itthatthe/dev/md0virtualdeviceinMD-RAIDstillsuffersfromtheclassicarchitecturallimitationintheLinuxblocklayerofasinglesubmission-completionqueueforblockdevices,eveniftheunderlyingNVMedevicessupportthenewblock-multiqueuearchitecture.

7.AcknowledgementsTheauthorswanttothankthefollowingpeoplefortheirsupportandtheirhelpinthisproject:FromIntel:ChrisAllison,MelanieFekete,AndreyKudryavtsev,CyndiPeach.FromSiliconGraphicsInternational:JamesHooks,JohnKichury,KirillMalkin.

Page 11: Throughput performance evaluation of the Intel® SSD … · Throughput performance evaluation of the Intel® SSD DC P3700 for NVMe on the SGI® UV™ 300 ... , paul.muzio@csi.cuny.edu}

11

8.References[1]IntelSolidStateDriveDCP3700Series-ProductSpecificationshttp://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-p3700-spec.html[2]HPCbodSGIracksUVbrains,reaches30MEEELLIONIOPShttp://www.theregister.co.uk/2014/11/17/sgi_uv_reaching_30_million_iops_with_nvme_flashers/[3]kdb+databasehttps://kx.com[4]SGI®UV™300HforSAPHANAhttps://www.sgi.com/pdfs/4554.pdf[5]Bjørling,Axboe,Nellans,Bonnet:LinuxBlockIO:IntroducingMulti-queueSSDAccessonMulti-coreSystemshttp://kernel.dk/systor13-final18.pdf[6]Malkin,Patel,Higdon:DeployingIntelDCP3700FlashonSGIUVSystems-BestPracticesforEarlyAccess(SGIProprietary)[7]Malkin:DeliveringPerformanceofModernStorageHardwaretoApplicationshttp://blog.sgi.com/delivering-performance-of-modern-storage-hardware-to-applications[8]Kudryavtsev,Bybin:Hands-onLab:HowtoUnleashYourStoragePerformancebyUsingNVMExpressBasedPCIExpressSolid-StateDrives-IntelDeveloperForum2015http://www.slideshare.net/LarryCover/handson-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solidstate-drives[9]XFSFAQ,http://xfs.org/index.php/XFS_FAQ[10]GPFS3.5Concepts,Planning,andInstallationGuidehttp://www-01.ibm.com/support/docview.wss?uid=pub1ga76041305[11]IORbenchmark,https://github.com/chaos/ior[12]Torvalds,EmailforumexchangesonO_DIRECThttps://lkml.org/lkml/2007/1/10/233http://yarchive.net/comp/linux/o_direct.html[13]IntelNVMeSSDsandIntelRSTeforLinuxhttp://www.intel.com/content/dam/support/us/en/documents/solid-state-drives/Quick_Start_RSTe_NVMe_for%20Linux.pdf