27
Lustre User Group 2010 - Aptos, CA Lustre deployment and early experiences Florent Parent Coordinator, Québec site

Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Lustre deploymentand early experiences

Florent ParentCoordinator, Québec site

Page 2: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Outline

• What is CLUMEQ?

• CLUMEQ's new HPC cluster: Colosse

• Lustre experience

2

Page 3: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

What is CLUMEQ?

• Consortium of 11 universities in the province of Québec, Canada

• Part of the Compute Canada national platform

• Two HPC sites:✓ Montréal✓ Québec City

3

Page 4: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Who is CLUMEQ?

4

Page 5: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

CLUMEQ's mission

• To serve the HPC needs of its member institutions in all fields of research

• To outreach non traditional and emerging HPC fields

• To train "highly qualified personel" (HQP)

5

Page 6: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA6

Compute Canada — Calcul CanadaA proposal to the

Canada Foundation for Innovation – National Platforms Fund

Hugh Couchman (McMaster University, SHARCNET)Robert Deupree (Saint Mary’s University, ACEnet)Ken Edgecombe (Queen’s University, HPCVL)Wagdi Habashi (McGill University, CLUMEQ)Richard Peltier (University of Toronto, SciNet)Jonathan Schaeffer (University of Alberta, WestGrid)David Senechal (Universite de Sherbrooke, RQCHP)

Executive Summary

The Compute/Calcul Canada (CC) initiative unites the academic high-performance comput-ing (HPC) organizations in Canada. The seven regional HPC consortia in Canada —ACEnet,CLUMEQ, RQCHP, HPCVL, SciNet, SHARCNET and WestGrid— represent over 50 institutionsand over one thousand university faculty members doing computationally-based research. TheCompute Canada initiative is a coherent and comprehensive proposal to build a shared distributedHPC infrastructure across Canada to best meet the needs of the research community and en-able leading-edge world-competitive research. This proposal is requesting an investment of 60 M$from CFI (150 M$ with matching money) to put the necessary infrastructure in place for fourof the consortia for the 2007-2010 period. It is also requesting operating funds from Canada’sresearch councils, for all seven consortia. Compute Canada has developed a consensus on nationalgovernance, resource planning, and resource sharing models, allowing for effective usage and man-agement of the proposed facilities. Compute Canada represents a major step forward in movingfrom a regional to a national HPC collaboration. Our vision is the result of extensive consultationswith the Canadian research community.

Page 7: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Outline

•What is CLUMEQ?

• CLUMEQ's new HPC cluster: Colosse

• Lustre experience

7

Page 8: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

CLUMEQ Colosse

• Sun constellation system✓ 10 fully loaded Sun Blade 6048, with X6275 modules

(double Nehalem EP blade, 2.8GHz, 24GB of RAM)

✓ full-bisection IB-QDR interconnect (2xM9 switches)

✓ 1 PB of Lustre storage in a high availability configuration, using 2 MDS and 9x2 OSS

✓ Sun J4400 storage arrays

• 86 Tflops peak ✓ 77 Tflops max (preliminary run)

✓ ---> 80 Tflops ?

8

Page 9: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Infiniband Architecture

9

24 nodes per shelf

QNEMIB leafswitch

M9648-port core

switch

M9648-port core

switch

4 LustreOSS

HA pairs

M2leaf sw.36 ports

LustreHA MDS

11infrastructure

nodes

M2leaf sw.36 ports

24 nodes per shelf

QNEMIB leafswitch

24 nodes per shelf

QNEMIB leafswitch

40 shelves in 10 racks

960 2-sockets nodes

.

.

.

.

.

.

.

.

5 LustreOSS

HA pairs

11infrastructure

nodes

12 links21 links

24 links

Page 10: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA10

Second floor contains all compute racks +

core networking switches

First floor contains file system &

infrastructure nodes

Racks aligned in a circle around a

central hot core; outside ring is a

cold aisle

Page 11: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA11

Street view...

Page 12: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA12

Satellite view...

Page 13: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Free air cooling system

Maincooling system

Air blowers

cooling coils

Page 14: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

cold air plenum(32 m2)

hot air core(25 m2)

Page 15: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA15

View inside hot air core

Page 16: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Main specifications

• Rack capacity: 56

• Cooling capacity: ~1.5 MW

• Electrical capacity: 1.1 MW (1.6 MW)

• Blowing capacity: 132,500 CFM

• Maximum air velocity: 2.4 m/s

• floor loading capacity: 940 lb/ft2

16

Page 17: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Outline

•What is CLUMEQ?

• CLUMEQ's new HPC cluster: Colossus

• Lustre experience

17

Page 18: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Timeline

• New staff hired in Apr 2009 and July 2009

• Nov 12: System acceptance signed

• Dec 16: First “Beta users” on machine

• Nov - now: Learning, debugging, patching, tuning, helping users

18

Page 19: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Experience so far...

• Many technologies to get up to speed on installing, monitoring, debugging, tuning ...✓ Lustre

✓ Infiniband

✓ Grid Engine

• So far, pretty much everything is learned as we go✓ asking questions to SMEs

✓ reading documentation, mailing list discussions

19

Page 20: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

CLUMEQ Lustre deployment

• At acceptance:✓ CentOS 5.3 (2.6.18-128.2.1.el5)✓ OFED 1.4.1, Lustre 1.8.1.1

• Now✓ CentOS 5.3 (2.6.18-164.11.1.el5_lustre.1.8.2)✓ OFED 1.4.1(?) stock from RH✓ Lustre 1.8.2 + patch

20

Page 21: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Lustre fixes

• OSS crash during heartbeat/failover✓ “CPU hog/soft lockups” bug (21612, 19557)✓ Patched in 1.8.2

• 1.8.2 installed when GA✓ Started to see high load on MDT, Lustre hanging on

clients✓ inode link count fix (22177)✓ Installed patched version

21

Page 22: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Filesystem structure

• Lustre is the only FS on Colosse✓ /home for user accounts✓ /rap for group-shared space✓ /scratch for temporary data

• Lustre striping✓ /home and /rap use striping of 1 (typically small

files)✓ /scratch uses striping of 72 (parallel IO performance)

22

Page 23: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA23

!"#$%&'()!'(*+,-*+.'!/01234+*52/0'

!

! ! ! !

!

!

!

!

"#$!%$&'!())*!

! +,-.!(/! !

!

012345-&$! ! 6!789!:5#41;3;$.<;!1=!0,9,>,'! ! ())*!

! ?8$&14!@!A1B.4$!C,=1;;.!

!

!

789!019=5>.9$5,D!

'

!"!#$%&'()%*#+',-#.'/01&2#

!

!"#$%&'(')**'+,-./'+01+('

10 4

589 USB

3

2 67

45 12

14

10 13

11

15

TOPFAN

REARPS

SunFire X 4250

OK

10 4

589 USB

3

2 67

45 12

14

10 13

11

15

TOPFAN

REARPS

SunFire X 4250

OK

10 4

589 USB

3

2 67

45 12

14

10 13

11

15

TOPFAN

REARPS

SunFire X 4250

OK

10 4

589 USB

3

2 67

45 12

14

10 13

11

15

TOPFAN

REARPS

SunFire X 4250

OK

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

OSS1OSS2

OSS4OSS3

a b c de f g hi j k l

m n o pq r s tu v w x

ax ay az babb bc bd bebf bg bh bi

y z aa abac ad ae afag ah ai aj

bv bw bx bybz ca cb cccd ce cf cg

bj bk bl bmbn bo bp bqbr bs bt bu

ak al am anao ap aq aras at au av

ch ci cj ckcl cm cn cocp cq cr cs

md12,14 md10

md11md13,15

md30md32,34

md31md33,35

md20md22,24

md21md23,25

md40md42,44

md41md43,45

OSS HA and RAID setup

• OSS HA pairs✓ 4 OST per OSS

✓ Linux heartbeat used to signal failure

✓ Have 8 OST on OSS in failure mode

• RAID 6✓ 10 1TB disk per OST

✓ Software raid (md)

Page 24: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

HA and heartbeat

• Linux heartbeat/HA✓ Does not always work. Seen issues w.r.t. node not

able to “kill” its neighbor (IPMI issues, investigating)✓ Currently working in manual HA mode

• Observing that it takes a long time for clients to use the new OSS taking over OSTs. ✓ Not sure yet why. Needs investigating. todo++

24

Page 25: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Finding a failed disk

• “Interesting” experience

• Device name to physical location is not reliable✓ need to double check with md*

commands

• Then came “blinkenlights”✓ http://wikis.sun.com/display/

HPCSoftware/JBOD+Troubleshooting+Utilities

✓ REALLY useful!

25

10 4

589 USB

3

2 67

45 12

14

10 13

11

15

TOPFAN

REARPS

SunFire X 4250

OK

10 4

589 USB

3

2 67

45 12

14

10 13

11

15

TOPFAN

REARPS

SunFire X 4250

OK

OSS1OSS2

a b c de f g hi j k l

m n o pq r s tu v w x

ax ay az babb bc bd bebf bg bh bi

y z aa abac ad ae afag ah ai aj

bv bw bx bybz ca cb cccd ce cf cg

bj bk bl bmbn bo bp bqbr bs bt bu

ak al am anao ap aq aras at au av

ch ci cj ckcl cm cn cocp cq cr cs

md12,14 md10

md11md13,15

md30md32,34

md31md33,35

md20md22,24

md21md23,25

md40md42,44

md41md43,45

Page 26: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Performance

• IOR used for parallel I/O measurements✓ IOR read performance = 33.6 GB/s✓ IOR write performance = 17.3 GB/s✓ all over IB

• Performance monitoring✓ Is there a BCP out there?

26

Page 27: Lustre deployment and early experiencesLustre deployment and early experiences Florent Parent Coordinator, Québec site. ... •Consortium of 11 universities in the province of Québec,

Lustre User Group 2010 - Aptos, CA

Conclusion

• Learned quite a lot in the past few months

• Lustre support team is key

• Looking to share experience

27