17
McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman [email protected] *

[email protected] Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman [email protected]

  • Upload
    haanh

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada

Guillimin HPC Users MeetingJune 19, 2014

Bart [email protected]

*

Page 2: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Compute Canada News• Upcoming Maintenance Downtime in August• Storage System News• Scheduler Updates and Demonstration• Software and User Environment Updates• Training News• New Visualization and Collaboration Environment

Outline

*Guillimin HPC Users Meeting

Page 3: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Compute Canada SPARC (Sustainable Planning for Advanced Research Computing)

– consultation process with the research community to build a national plan for advanced computing, data storage and archiving requirements

– targeted for CFIs planned renewal of Compute Canada infrastructure as well as funding for domain specific data projects

– consultations (white papers, workshops) in summer to prepare a preliminary plan for November 2014

– renewal plan due April 2015– Notices of Intent for domain proposals due Jan. 2015– More info: www.computecanada.ca

Compute Canada News

*Guillimin HPC Users Meeting

Page 4: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Guillimin Maintenance Downtime: August 4 - 7– Maintenance outage to the data centre cooling

distribution system– Will require stoppage of all logins, data access and

batch job activities– Further information regarding the planned maintenance

downtime will be distributed by middle of July.

Maintenance Downtime

*Guillimin HPC Users Meeting

Page 5: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• GPFS Stability Issues - Update– Regular occurrence of GPFS stability on nodes due to node expels

• Typical impact: interruption or halt of writing from jobs– Investigation with GPFS and IB support team ongoing with critical

priority– Latest Actions (June 11): 2nd update to all node IB network tunings

• additional increase in receive queue size for IP-over-IB communications across the much larger scale IB fabric

• Continue to observe significant decrease in number of node expels (~1-2 every few days - major stability improvement)

• additional investigations to further improve performance

Storage System News

*Guillimin HPC Users Meeting

Page 6: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Reminder: Upcoming Activities– Online expansion of /gs to full target size (~ 2.9PB)

– Tape Archive (Backup) and Hierarchical Storage Management (HSM) Integration

• Migration of scratch policy to use HSM rules for identification and cleanup (In Progress)

• Analyzing characteristics of file system contents to identify suitable HSM migration policies (In Progress)

• Access to tape for targeted backups – (In Progress)

Storage System News

*Guillimin HPC Users Meeting

Page 7: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• In general improved overall stability and performance– A few outstanding issues under review with Adaptive

Computing– Testing in development environment with update to Torque

4.2.8 in progress

• Recall: April 10 - qsub for job submission enabled – Default PATH settings updated to include Torque commands

(qsub, qstat, …)– Much faster response for submissions, queries compared to

Moab commands (msub, canceljob, …)– qsub submission filter: qsub –A <RAPid> required for proper

accounting and priority assignment (will be relaxed later)

Scheduler Update

*Guillimin HPC Users Meeting

Page 8: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Job submission documentation updated– www.hpc.mcgill.ca � Documentation � Submitting Your Job

• With migration to CentOS 6 nodes are set to new scheduler

Scheduler Update

•In the default queue, the chosen node depends on the pmem (memory) PBS parameter or node feature (ie. m256G, m512G, …)•Internal routing for “short” jobs in default queue

*Guillimin HPC Users Meeting

Page 9: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Default Queue - Serial Jobs(nodes=1:ppn=n, n<12) (new:SW2)

• Default Queue - Parallel Jobs (new:higher walltime boundary)

• (*) default if procs > 12 or nodes > 1 (which need to communicate over IB)• (**) default if procs = 12 or nodes=1:ppn=12

Scheduler Update

*Guillimin HPC Users Meeting

Page 10: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Extra large memory nodes (XLM2)

● Alternative to ScaleMP (offline, to be reimaged to CentOS-6 next week)● Some nodes reserved by CFI grant holders, others get 12 hours only● Example PBS submission lines:● #PBS -l nodes=1:ppn=16,pmem=11700m,walltime=10:00:00 (any XLM2 node)● #PBS -l nodes=1:ppn=16,pmem=11700m,walltime=1:00:00:00 (non-CFI nodes only)● #PBS -l nodes=1:ppn=1,pmem=31700m,walltime=10:00:00 (serial on m512G/m1024G)● #PBS -l nodes=1:ppn=16,pmem=31700m,walltime=10:00:00 (16 cores on m512G/m1024G)● #PBS -l nodes=1:ppn=16,pmem=31700m,walltime=1:00:00:00 (16 cores on m512G: non-CFI)● #PBS -l nodes=4:ppn=16,pmem=31700m,walltime=1:00:00:00 (all cores on m512G nodes)● #PBS -l nodes=1:ppn=32:m1024G,pmem=31700m (specific node type, IF you are the CFI holder)● #PBS -l nodes=1:ppn=16:m256G,pmem=15700m (specific node type)

Scheduler Update

*Guillimin HPC Users Meeting

Page 11: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Examples: why is my job not running yet?• checkjob -v JOB_ID (can also use -v -v, etc.)• showq• showq -i -v• showq -r -v• showq -w class=<queue_name>• showq -w class=hb• showq -w class=hbplus• showq -w class=hb -r• showq -w class=hbplus -r• showq -w class=hb -i

Scheduler Update

*Guillimin HPC Users Meeting

Page 12: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• New Installations– petsc/3.4.4-openmpi-1.6.3-{gcc,intel}– h5py for python/2.7.3

• GPU Updates– Driver update completed: NVIDIA-Linux-x86_64-331.67– Update to /etc/bashrc on GPU nodes to allow for correct

operation of the NVIDIA Profiler• MDCS and Matlab Update

– April 22 - license manager migrated to CentOS 6– Now supports up to 2014a– Includes update to standard Matlab license for McGill users (access

restricted due to Mathworks license requirements)

Software Update

*Guillimin HPC Users Meeting

Page 13: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Compiler Updates / Additions to come– Intel 14.0.2– License manager migration required to support newer Intel

installations– Long-term: project to standardize modules across Calcul

Québec

• Others in progress– MIO2/1.0 – modular I/O library from IBM Research– IOBUFF from Calcul Québec

Software Update

*Guillimin HPC Users Meeting

Page 14: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• See ‘Training’ at www.hpc.mcgill.ca for our full calendar of training and workshops planned for 2014 and to register

• Upcoming:– July 10 - MapReduce and Hadoop for Big Data– August 17 - Scientific Visualization Tools

• Recently Completed:– June 5 - Advanced OpenMP– May 22 - Introduction to the Xeon Phi

Training News

*Guillimin HPC Users Meeting

Page 15: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Located at the McGill HPC Centre at ETS (Peel and Notre-Dame O.)– Polycom Group 700 HD series multi-point conferencing unit– Two 55” LED LCD and one 65” LED LCD screens– Crestron AirMedia for wireless connectivity– room capacity of 10 - 15 people

– Room is available for video-conferencing or data visualization– Contact us at [email protected] to access this resource

New Visualization & Collaboration Environment

*Guillimin HPC Users Meeting

Page 16: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Work has started on the summer upgrade of network link from data centre at ETS to McGill core network– Upgrade from 10 to 40 Gbps– Will include 10 Gbps connection to the Calcul Québec router

• Upgrade will enable support for projects requiring additional dedicated network bandwidth in/out from the data centre

• Testing to be completed in July and in production by end of August

Other Developments

*Guillimin HPC Users Meeting

Page 17: bart.oldeman@mcgill.ca Bart Oldeman · McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Guillimin HPC Users Meeting June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca

• Questions? Comments?• We value your feedback.• Guillimin Operational News for Users

– Follow us on Twitter: http://twitter.com/McGillHPC

User Feedback and Discussion

*Guillimin HPC Users Meeting