Upload
florence-ball
View
218
Download
0
Embed Size (px)
Citation preview
1DRAFT 1
Institutional Research Computing at WSU:A community-based approach
Governance model, access policy, and acquisition strategy
for consideration by the ITSAC Research Computing Sub-committee
August 5, 2015
WASHINGTON STATE UNIVERSITY
DRAFT
DRAFT
Institutional research computing platforms managed by central IT (WSU/Pullman): http://officeofresearch.wsu.edu/researchcomputing/
2
Current HPC platform IBM I-Dataplex (2011) WSU Kamiak (pilot) Cluster (2015 +)
— Compute nodes: • 32 CPU nodes (20 cores – 256 GB / 512 GB)• 2 NVIDIA GPU node • 1 Phi (Intel Xeon) GPU node
— Large memory nodes: • 2 TB RAM Server (60 cores)
— Storage: • NetApp File Storage (633 TB)
— Network: • Infiniband switch (1) • 40 Gb switch (1) • 10 Gb switches for network storage (4) • 10 GbE switches for network storage (3)
— Compute nodes: • 164 CPU nodes (12 cores – 24 GB)
— Large memory nodes: • 3 CPU nodes (32 cores – 512 GB)
— Storage: • No physical local disk at compute nodes
— Network: • Infiniband switch (1) • 40 Gb switch (1)
pilot Kamiak cluster
DRAFT1 rack = 16 sq. ft.
The WSU pilot condominium Kamiak cluster: A $1.3M / 3-rack system that balances compute and data-intensive research needs
• Funding: $1.3M = $0.8M (CAHNRS) + $0.5M (VPR)— Hardware: $1.18M — Operation: $0.12M = total for 5 years
• Location: — WSU/Pullman: ITB/Room 1010
• Operating funds: —2 FTEs: — Systems administrator for HPC — IT consultant for research computing (User
Support Group)• Schedule: — Procurement: January 2015 — Delivery: April 2015 — Installation and testing: January – October 2015 — Open to early adopters: October 2015
• WSU institutional research computing:—http://officeofresearch.wsu.edu/researchcomputing
/
Compute / Management / Storage
3
DRAFT
The WSU full-size condominium Kamiak cluster (phase 1): 9-rack system: Equipment and research grants; start-up funds; and other contributions from faculty, research staff, and academic units
WSU pilot Kamiak cluster
4Compute / Management / Storage
DRAFT
Principles of condominium research computing• What ? — Condominium computing provides users with shares of institutional
cluster-based computer resources.
• How ? — Institutional research computing resources are managed and administered by WSU central IT and
co-located in the IT building (ITB) — Investors purchase nodes that become part of the institutional condominium cluster — Investors retain ownership of the hardware they purchase — Investors have “on-demand” access to the resources they own — Unused resources are dynamically harvested and made available to general users— Multi-tiered queuing system implements different access levels: investors, general users, etc.
• Who ? — Available to the entire WSU community as an institutional resource
• Why (benefits) ? — Provides access to larger-scale and leveraged cyber-infrastructure for enhanced productivity
(“speed of science”) — Provides mid-scale HPC resources for production runs and development testbed— Prepares for extreme-scale computing at national facilities — Enables coordinated software installation, implementation, development, and optimization — Integrates system’s administration roles and responsibilities at the institutional level — Provides higher level of user support for application domain scientists
DRAFT 6
Sponsors, investors, and general users What ? Who ? How ? Comments
Sponsors — Colleges — Academic units — Office of Research
IT and research computing staffing: — Systems administrator — User support for research
computing
Possible contributions to cyber-infrastructure
Investors (Owners)
Faculty and researchers who require predictable computational availability
— Purchase “menu” equipment (compute nodes, storage, etc.)
— WSU/ITS purchases the nodes and deploys them in the shared infrastructure and operates them for a fixed number of years
— Once installed, purchased nodes become part of the Kamiak cluster
— Cost for a node is price of equipment + markup for IT systems administration and user support
General users
Entire WSU community Sponsored by their administrative College
— Unused compute cycles in the condominium are available for general users
— Access to “backfill” queue by general users can be preempted at any time by investors’ priority access
Institution — Office of Provost — Office of Research — Office of Finance— WSU/ITS — Colleges
Physical infrastructure: — Equipment room space— Power, cooling, etc. — Racks
Possible contributions to cyber-infrastructure
DRAFT 7
Proposed access policy
What ? Who ? How ?
Sponsors — Colleges — Academic units — Office of Research
Investors (Owners)
— Faculty and researchers who require predictable computational availability
— Faculty and researchers who contribute hardware (compute nodes, storage, etc.) to the institutional shared infrastructure
— Investors have “on-demand” access to their own nodes through a dedicated “batch” queue.
— All jobs submitted by investors to Kamiak in the "batch" queue will run on an investor’s own nodes.
— Investors have access to their dedicated “batch” queue and to the general “backfill” queue at increased priority in proportion to their investment.
General users Entire WSU community — Kamiak general users can submit jobs to the "backfill" queue where they will execute on idle CPUs wherever they are in the cluster.
— Kamiak implements several policies to ensure fair access.
— Backfill jobs may be preempted at any time if “investors” need access to their resources.
Institution — Office of Provost — Office of Research — Office of Finance — WSU/ITS — Colleges
DRAFT 8
Establishing a Governance Board for condominium computing
• Purpose: — Ensures that research computing assets (systems, cyber-infrastructure, processes,
access policy, etc.) are implemented and used according to agreed-upon policies and procedures— Ensures that research computing assets are properly controlled and maintained — Ensures that research computing assets are providing value to WSU and the
university’s research community. — Reviews applications for use of Kamiak resources — Arbitrates special requests for utilization of Kamiak resources
• Chairmanship and membership of the Kamiak Governance Board (proposal): — Chair: Vice President for IT Services and CIO — Membership: Investors (faculty, staff, etc.) and IT personnel
DRAFT
WSU is committing resources to establish a user support group for application software implementation, optimization, and development
• Establishment of a software user support group: “IT Research Computing Consultant” — Focus on research computing — Provide assistance in software installation, development, and optimization — Broad spectrum of application domains: — Materials science and engineering — Chemistry and chemical engineering — Bioinformatics — Genomics — Atmospheric research
— Parallel scientific computing — Installation and management of software libraries — Development of documentation and training material for the effective use of
institutional HPC resources
• Support: — Institutional support from Colleges (CAS, CAHNRS, and VCEA) and the Office of the
VPR (1 FTE)
9
DRAFT
Hardware acquisition strategy • Principles of WSU’s proposed business model: — Cost to users is less than purchasing and operating stand alone equipment — Cost to WSU is less than users acting independently
• Cost model:— Based on 5-year lifecycle— Price ranges are driven by how systems’ administration and infrastructure costs are covered— Includes full 5-year hardware maintenance— Includes costs for most IT related infrastructure – 10GbE local network, 10GbE connection to
HSSRC, FDR Infiniband, management nodes, etc.— Flexible hardware configurations on memory and CPU
Equipment Specifications Price range ($K) 5-year
Standard Compute Node Large Compute Node
2x Intel E5-2680v2, 20 total cores / 40 total threads: 256GB RAM, 400GB SSD, 10GbE, FDR Infiniband 512GB RAM, 400GB SSD, 10GbE, FDR Infiniband
7.5 – 15 17 – 24
Large Memory Node 4x Intel E7-4880v2, 60 total cores / 120 total threads: 2TB RAM, 10Gb, FDR Infiniband 57 – 66
GPU Compute Node 2x Intel E5-2670v3, 24 total cores / 48 threaded cores 2 Tesla K-80 GPUs / 9984 CUDA Cores: 256GB RAM, 400GB SSD, 10GbE, FDR Infiniband 16 – 22
150 TB storage SSD, 10K, 7.2K tiers 50
DRAFT
Condominium computing: Adopting best practices from successful implementations
• Several successful case studies in academia: — Clemson University:
http://www.clemson.edu/ccit/about/departments/citi/ — Purdue: https://www.rcac.purdue.edu — UC Berkeley: http
://research-it.berkeley.edu/services/high-performance-computing/institutional-and-condo-computing
— UW: http://escience.washington.edu/content/hyak-0
• Successful implementation of condominium computing: Common features to all models — Provides benefits to the research community. — Enables computing “at-scale” by allowing a surge in computing
capability. — Reduces the time and money spent on maintaining
computational resources. — Researchers have confidence in the center being able to meet
all their needs. — Center establishes a proven track record of providing resources,
so that researchers focus on their research without worrying about maintaining their hardware
— Sustained institutional support for embedding infrastructure: initial condo system, upgrades, re-capitalization, network, support personnel, “cyber-institute”, etc.
Clemson Palmetto Condo HPC• Clemson Computing and Information Technology:
⎯ Provides cyber-infrastructure resources and HPC capabilities
⎯ Provides advanced knowledge infrastructure through integration of HPC, networks, data visualization, and storage architecture.
⎯ Palmetto condo HPC: ⎯ 85 Tflops peak performance⎯ 1,541 nodes / 8 cores per node ⎯ 120 TB high-performance storage
⎯ Condo owners (“tenants”) are buying "preemption units."
⎯ Preemption units give an owner job the ability to preempt general jobs if needed in order to acquire the resources needed to run and prevent the owner job itself from being preempted.
⎯ Unused owners’ resources are dynamically harvested for general users
12DRAFT 12
Back-up
WASHINGTON STATE UNIVERSITY
DRAFT