UCSF07 - Research and HPC Infrastructure_Award_2007

University of California Larry L. Sautter Award Submission

Innovation in Information Technology at the University of California, San Francisco

Submitted By:

Mr. Michael Williams, MA University of California

Executive Director, Information Technology UC San Francisco Diabetes Center, Immune Tolerance Network

and Chief Information Officer

UC San Francisco Neurology, Epilepsy Phenome/Genome Project

Telephone: (415) 860-3581 Email: [email protected]

Date Submitted: Friday, May 18, 2007

Page 1

Table of Contents 1. PROJECT TEAM...........................................................................................2

1.1. TEAM LEADERS ........................................................................................2 1.2. TEAM MEMBERS .......................................................................................2

2. PROJECT SUMMARY AND SIGNIFICANCE...............................................4

3. PROJECT DESCRIPTION ............................................................................6

3.1. BACKGROUND INFORMATION .....................................................................6 3.2. SITUATION PRIOR TO ARCAMIS ...............................................................7 3.3. AFTER ARCAMIS DEPLOYMENT ...............................................................8 3.4. BUSINESS IMPACT...................................................................................13

4. TECHNOLOGIES UTILIZED.......................................................................14

4.1. THE ARCAMIS SUITE ............................................................................14 4.2. ITIL TEAM BASED OPERATING MODEL.....................................................15 4.3. SECURITY MODEL AND ARCHITECTURE.....................................................17 4.4. DATA CENTER FACILITIES .......................................................................21 4.5. INTERNET CONNECTIVITY.........................................................................23 4.6. VIRTUAL CPU, RAM, NETWORK, AND DISK RESOURCES ..........................23 4.7. OPERATING SYSTEMS SUPPORTED ..........................................................25 4.8. BACKUP, ARCHIVAL, AND DISASTER RECOVERY .......................................25 4.9. MONITORING, ALERTING, AND REPORTING................................................25 4.10. IT SERVICE MANAGEMENT SYSTEMS ....................................................27

5. IMPLEMENTATION TIMEFRAME ..............................................................28

5.1. PROJECT TIMELINE .................................................................................28

6. CUSTOMER TESTIMONIALS.....................................................................29

APPENDICES.....................................................................................................30

APPENDIX A – CAPABILITIES SUMMARY OF THE ARCAMIS SUITE ........................30 APPENDIX B – EXCERPT FROM THE ARCAMIS SYSTEMS FUNCTIONAL SPECIFICATION..................................................................................................33

Page 2

1. Project Team

1.1. Team Leaders

Michael Williams, M.A.

Executive Director, Information Technology

UC San Francisco Diabetes Center, Immune Tolerance Network

and

Chief Information Officer


Gary Kuyat

Senior Systems Architect, Information Technology

UC San Francisco Diabetes Center, Immune Tolerance Network

and


1.2. Team Members

Immune Tolerance Network Information Technology:

Jeff Angst

Project Manager

Lijo Neelankavil

Systems Engineer

Diabetes Center Information Technology:

Aaron Gannon

Systems Engineer

Project Sponsors:

Michael Williams, M.A.

Executive Director, Information Technology, Immune Tolerance Network

Page 3

Jeff Bluestone, Ph.D.

Director Diabetes Center and Immune Tolerance Network

Dr. Daniel Lowenstein, M.D.

Department of Neurology at UCSF, Director of the UCSF Epilepsy Center

Dr. Mark A. Musen, M.D., Ph.D.

Professor; Head, Stanford Medical Informatics

Dr. Hugh Auchincloss, MD.

Chief Operating Officer, Immune Tolerance Network (at time of project)

Currently - Principal Deputy Director of NIAID at NIH

Page 4

2. Project Summary and Significance

By deploying the Advanced Research Computing and Analysis Managed

Infrastructure Services (ARCAMIS) suite, the Immune Tolerance Network

(ITN) and Epilepsy Phenome Genome Project (EPGP) at the University of

California, San Francisco (UCSF) has implemented multiple Tier 1 networks

and physically secured enterprise class datacenters, storage area network

(SAN) data consolidation, and server virtualization to achieve to achieve a

centralized, scalable network and system architecture that is responsive,

reliable, and secure. This is combined with a nationally consistent, team

centric operating model based on Information Technology Infrastructure

Language (ITIL) best practices. Our deployed solution is compliant with

applicable confidentiality regulations and assures 24 hour business

continuance with no loss of data in the event of a major disaster. ARCAMIS

has also provided significant savings on IT costs.

Over the last 3 years we have efficiently met the constantly expanding

demands for IT resources by using virtualization of disk, CPU, RAM, network,

and ultimately servers. ARCAMIS has allowed us to provision and support

hundreds of production, staging, testing, and development servers at a ratio

of 25 guests to one physical host. By using IP remote management

technologies that do not require physical presence, server consolidation and

virtualization, combined with SAN based thin-provisioning of storage; we have

effectively untied infrastructure upgrades from service delivery cycles.

Furthermore, centralizing storage to a Storage Area Network (SAN) has given

us the ability to provide real-time server backups (no backup window) and

hourly disaster recovery snapshots to a Washington, DC, disaster recovery

(DR) site for business continuance within hours of a disaster.

ARCAMIS provides the University of California with a proven case study of

how to implement enterprise class IT infrastructures and operating models for

the benefit of NIH funded clinical research at UCSF. We have accelerated the

Page 5

time from the bedside to the bench in clinical research by taking the IT

infrastructure out of the clinical trials’ critical path, thereby providing a

positive impact on our core business: preventing and curing human

disease. ARCAMIS is more agile and responsive, having reduced server

acquisition time to a matter of hours rather than weeks. ARCAMIS is

significantly more secure and reliable, providing in the order of 99.998%

technically architected uptime, and we’ve greatly improved the performance

and utilization of our IT assets. We have created hundreds of thousands of

dollars in measurable costs savings. ARCAMIS is environmentally friendly,

significantly reducing our impact on environmental resources such as power

and cooling. ARCAMIS is able to be used as a blue-print for enterprise class

Clinical Research IT infrastructure services throughout the University of

California, at partner research institutions and universities, and the National

Institute of Health.

The technologies used: Hewlett Packard Proliant Servers and 7000c Series

Blades, VMWare Virtual Infrastructure Enterprise 3.01, Network Appliance

FAS3020 Storage Area Network, Cisco, Brocade, Red Hat LINUX Enterprise,

and Microsoft Windows Server 2003, among others.

Page 6

3. Project Description

3.1. Background Information

The mission of the Immune Tolerance Network (ITN) is to prevent and cure

human disease. Based at the University of California, San Francisco (UCSF),

the ITN is a collaborative research project that seeks out, develops and

performs clinical trials and biological assays of immune tolerance. ITN

supported researchers are developing new approaches to induce, maintain,

and monitor tolerance with the goal of designing new immune therapies for

kidney and islet transplantation, autoimmune diseases and allergy and

asthma. Key to our success is the ability to collect, store and analyze the

huge amount of data collected on ITN’s 30+ global clinical trials at 90+

medical centers, in a secure and effective manner, so a reliable, scalable and

adaptable IT infrastructure is paramount in this endeavor. The ITN is in the

7th year of 14-year contracts from the NIH, National Institute of Allergy and

Infectious Diseases (NIAID), the National Institute of Diabetes and Digestive

and Kidney Disorders (NIDDK) and the Juvenile Diabetes Research

Foundation.

The Epilepsy Phenome/Genome Project (EPGP) studies the complex genetic

factors that underlie some of the most common forms of epilepsy; bringing

together 50 researchers and clinicians from 15 medical centers throughout

the US. The overall strategy of EPGP is to collect detailed, high quality

phenotypic information on 3,750 epilepsy patients and 3,000 controls, and to

use state-of-the-art genomic and computational methods to identify the

contribution of genetic variation to: the epilepsy phenotype, developmental

anomalies of the brain, and the varied therapeutic response of patients

treated with AEDs. This initial 5 year grant is being funded by the NIH,

National Institute of Neurological Disorders and Stroke (NINDS).

The ITN and EPGP turned to computing infrastructure centralization in Tier 1

networked enterprise class datacenters, virtualization, and data consolidation

Page 7

onto a Storage Area Network (SAN) with off-site disaster recovery replication

to address these challenges. Combined with an ITIL, team based, nationally

consistent operating model leveraging specificity of labor; we are in a position

to efficiently and scaleably respond to the increasing demands of the

organization and rapidly adapt the IT infrastructure to dynamic management

goals. This is accomplished while minimizing costs and maintaining requisite

quality: we have a true high availability architecture, assuring zero data loss.

3.2. Situation Prior to ARCAMIS

Like most of today’s geographically dispersed IT organizations, we were faced

with the challenge of providing IT services in a timely, consistent, and cost

effective manner with high customer satisfaction. Unlike other organizations,

ITN and EPGP have many M.D. and Ph.D. clinical research knowledge workers

with higher then normal, computationally intensive, IT requirements.

Escalating site-specific IT infrastructure costs, unpredicted downtime,

geographically inconsistent process and procedure, and lack of a team based

operating model were among the challenges being faced to support such a

multi-site infrastructure. There was a general sense that IT could do better.

Risk of data loss was real. Dynamically growing demands were making it

more difficult to consistently provide high IT service quality, site IT staff were

largely reactionary and isolated. Prior to the ARCAMIS deployment, the IT

infrastructure faced many challenges:

1. High costs of running and managing numerous physical servers at

inconsistent, multiple-site, server rooms, such as power consumption

with poor reliability, sub-standard cooling, and poorly laid out physical

space. Intermittent and unexpected local facility downtime was

common. Global website services were served out of office servers

connected via single T1 lines.

2. Lead time for delivering new services was typically 6 weeks which

directly impacted clinical trials’ costs. Procuring and deploying new

infrastructure for new services or upgrades were major projects

requiring significant downtime and direct physical presence of IT staff.

3. Existing computing capacity was underutilized, but still required

technical support such as backups and patches; with individualized site

Page 8

based process and procedures. Little automation caused significant

effort for administration. There was a huge amount of IT

administrative effort to manage site specific physical server support,

asset tracking and equipment leases at multiple sites.

4. Lack of IT staff team operating model, consistently automated

architecture, and remote management technologies resulted in process

and procedure inconsistency at any one site and led to severe variance

in service quality and reliability by geography.

5. IT maturity prevented discussion of higher level functions such as

auditable policies and procedures, disaster recovery, redundant

network architectures, and security audits; all required for NIH Clinical

Trail safety compliance.

3.3. After ARCAMIS Deployment

ARCAMIS represents a paradigm shift in our IT philosophy both operationally

and technically. The goal was to move out of a geographically specific,

reactionary mode to a prospective operating model and technical architecture

designed from the ground up to be in alignment with the organizations

growing, dynamic demands for IT services.

Most importantly, we worked with management prospectively to understand

service quality expectations and requirements to scale up to 30 clinical trails

in 7 years. Given management objectives and our limited resources, we

realized a need for a more team centric operating model, providing specificity

of labor. As a result, we logically grouped our human resources into the

Support team and Architecture team. This gave more senior technical talent

the time they needed to re-engineer, build, and migrate to the ARCAMIS

solution while more junior talent continued to focus on reactionary issues.

From a technical perspective, we engineered an architecture that would

eliminate or automate time consuming tasks and improve reliability. By

centralizing all ARCAMIS Managed Infrastructure into bi-coastal, carrier

diverse, redundant, Tier 1 networked, enterprise class datacenters and using

fully “lights-out”, Hewlett Packard Proliant Servers with 4 hour on-site

Page 9

physical support and a remote, IP based, server administration model, we

have dramatically improved service reliability and supportability without

adding administrative staff. The same senior staff now supports twice the

number of physical servers and 20 times the virtual servers. For example, it

is now common for engineers to administrate infrastructure at all seven sites

simultaneously, including hard reboots and physical failures.

With the integration of the VMWare Virtual Infrastructure 3.01 Enterprise

infrastructure virtualization technologies, ARCAMIS reduces the number of

physical servers at our data centers while continuing to meet the

exponentially expanding business server requirements. Less hardware yields

a reduction in initial server hardware costs and saves ongoing data center

lease, power and cooling costs associated with ARCAMIS infrastructure. The

initial capital expenditure was about the same as purchasing physical servers

due to our investment in virtualization and SAN technologies.

Consolidating all server data onto the Network Appliance Storage Area

Network; the ARCAMIS project deployed a 99.998% uptime, 25 TB,

production and disaster recovery cluster in San Francisco and a 25 TB

99.998% uptime production and disaster recovery cluster site in the

Washington, DC metro area. The SAN allows us to reduce cost and

complexity via automation, resulting in dramatic improvements in operations

efficiency. We can more efficiently use what we already own, oversubscribe

disk, and eliminate silos of underutilized storage. Current storage usage at

the primary site is 65%, up from 25% average per server using Direct

Attached Storage (DAS). We can seamlessly scale to 100 terabytes of

storage by simply adding disk shelves, not possible with a server based

approach. Another key benefit of using SAN technology is risk mitigation via

completely automated backup, archival, and offsite replication. File restores

are instantaneous, eliminating the need for human resource intensive and

less reliable tape backup approaches.

Combining the SAN with VMWare Infrastructure 3.01 Enterprise server

virtualization technologies provides reliable, extensible, manageable, high

availability architecture. Adjusting to changing server requirements is simple

Page 10

because of the SAN’s storage expansion and reduction capability for live

volumes and VMWare’s ability to scale from 1 to 4 64-bit CPUs with up to

16GB RAM and 16 network ports per virtual server. Also, oversubscription

allows the ITN to more efficiently use the disk, RAM, and CPU we already

own. We can seamlessly control server, firewall, network and data adds,

removes, and changes without business service interruption. The SAN and

VMWare ESX combination provides excellent performance and reliability using

both Fiber Channel & iSCSI Multipathing for a redundant disk to server access

architecture. For certain applications we can create Highly Available

Clustered Systems truly architected to meet rigorous 99.998% uptime

requirements. VMs boot from the SAN and are replicated locally and off-site

while running. This improves business scalability and agility via accelerated

service deployment and expanded utilization of existing hardware assets.

Physical server maintenance requiring the server to be shut down or rebooted

is done during regular working hours without downtime due to support for

VMotion, the ability to move a running VM from one physical machine to

another. This has greatly reduced off-hours engineer work. The increasing

data security & compliance requirements are also able to be met with the

centralized control provided by the SAN. In our experience, storage

availability determines service availability; automation guarantees service

quality of storage.

Page 11

NetworkAppliance

13 12 11 10 09 08 07 06 05 04 03 02 01 00

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14

NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance

72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

NetworkAppliance

13 12 11 10 09 08 07 06 05 04 03 02 01 00

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

FAS 3050

activity status power

NetworkAppliance

13 12 11 10 09 08 07 06 05 04 03 02 01 00

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

NetworkAppliance

13 12 11 10 09 08 07 06 05 04 03 02 01 00

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

FAS 3050


16 port FC switch 16 port FC switch

Active/Active High Availablity 25TB Fiber Channel Cluster

Passive Synchronization Between Sites

VMWare Server

VMWare Server

VMWare Server

VMWare Server

VMWare Server

San Francisco, CA

VMWare Server

VMWare Server

VMWare Server

VMWare Server

VMWare Server

VMWare ESX Server 1 VMWare ESX Server 2

NetworkAppliance

13 12 11 10 09 08 07 06 05 04 03 02 01 00

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

NetworkAppliance

13 12 11 10 09 08 07 06 05 04 03 02 01 00

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

FAS 3050


NetworkAppliance

13 12 11 10 09 08 07 06 05 04 03 02 01 00

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

NetworkAppliance

13 12 11 10 09 08 07 06 05 04 03 02 01 00

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

FAS 3050


16 port FC switch 16 port FC switch

Active/Active High Availablity 25TB Fiber Channel Cluster

VMWare Server

VMWare Server

VMWare Server

VMWare Server

VMWare Server

Herndon, VA

VMWare Server

VMWare Server

VMWare Server

VMWare Server

VMWare Server

VMWare ESX Server 1 VMWare ESX Server 2

The ARCAMIS project has proven and demonstrated the many benefits

promised by these new enterprise class technologies. We have significantly

increased the value of IT to our core business, slashed IT operating costs, and

radically improved the quality of our IT service. The ARCAMIS architecture

and operating model is a core competency which other UC organizations can

leverage to achieve similar benefits.

Just some of the benefits resulting from the ARCAMIS project include the

following:

1. Saved hundreds of thousands of dollars and improved security,

reliability, scalability, and deployment time.

Page 12

2. Helped the environment by reducing power consumption by a factor of

20 for a comparable service infrastructure.

3. Improved conformance with federal and state regulations such as

HIPAA and 21 CFR Part 11.

4. Centralized critical infrastructures into Tier 1, redundantly multi-

homed network, enterprise class datacenters. Space utilization at our

8 sites has been consolidated to three data centers using 5 server

racks.

5. Eliminated the inconsistent complexity of our IT infrastructure and

processes/procedures, and ensured uptime for our business critical

applications; even in the event of hardware failures. All new solution

deployments are done based on nationally consistent operating models

and technical architectures.

6. Consolidated data to SAN and VMWare servers. The infrastructure is

architected to be a true 99.998% uptime solution. Our biggest

downtime risk is human error.

7. Any staff member with security privileges can manage any device at

any site from any Internet connected PC; including hardware failures

and power cycles. We efficiently provision, monitor and manage the

infrastructure with a single top console.

8. Standardized virtual server builds, procurement and deployment time

reduced from as much as 6 weeks to 2 hours without investing in new

server hardware.

9. Automated backup, archival, and disaster recovery.

10. Cloned production servers for testing and troubleshooting. Systems

and networks can be cloned while running, with zero downtime and

rebooted in virtual lab environments. Servers can be rotated back into

production with only a few seconds downtime.

11. Average CPU utilization has risen from 5% to 30% while retaining peak

capacity.

12. Disk utilization has risen from 25% to 65%

13. Savings are in the region of $200,000 in the past 12 months. This will

grow exponentially as the architecture scales.

14. Multiple operating systems are supported, including: RedHat LINUX,

MS Windows 2000, MS Windows 2003, with both 32 and 64 bit

Page 13

versions of all operating systems supported. These can all be

deployed on the same physical server, providing us with reduced

dependence on vendor’s proprietary solutions.

15. Reduced support overhead and power consumption of legacy

applications by migrating these into the virtual environment.

3.4. Business Impact

ARCAMIS provides the University of California with a proven case study of

how to implement enterprise class IT infrastructures and operating models for

the benefit of NIH funded clinical research at UCSF. We have accelerated the

time from the bedside to the bench in clinical research by taking the IT

infrastructure out of the clinical trials’ critical path, thereby providing a

positive impact on our core business: preventing and curing human

disease. ARCAMIS is more agile and responsive, having reduced server

acquisition time to a matter of hours rather than weeks. ARCAMIS is

significantly more secure and reliable, providing in the order of 99.998%

technically architected uptime, and we’ve greatly improved the performance

and utilization of our IT assets. We have created hundreds of thousands of

dollars in measurable costs savings. ARCAMIS is environmentally friendly,

significantly reducing our impact on environmental resources such as power

and cooling. ARCAMIS is able to be used as a blue-print for enterprise class

Clinical Research IT infrastructure services throughout the University of

California, at partner research institutions and universities, and the National

Institute of Health.

Page 14

4. Technologies Utilized

4.1. The ARCAMIS Suite

This suite of Academic Research Computing and Analysis Managed

Infrastructure Services (ARCAMIS) includes the following technology

components:

1. ITIL based, nationally consistent, labor specific, team IT operating

model

2. Security model and architecture (including firewalls, intrusion

detection, VPN, automated updates)

3. Enterprise class data center facilities

4. Tier 1, multi-homed, redundant, carrier diverse, networks

5. Virtual CPU, RAM, Network, and Disk resources based on Hewlett

Packard Proliant servers, VMware Infrastructure Enterprise 3.01 and

Network Appliance Storage Area Network (SAN)

6. Various 32 and 64 bit LINUX and Windows Operating Systems

7. Backup, archival and disaster recovery

8. Monitoring, alerting, and reporting

9. IT service management systems

Page 15

4.2. ITIL Team Based Operating Model

The ARCAMIS Operating Model is based on an ITIL best practices, nationally

consistent, team based, and uses specificity of labor. Via our formal,

documented, Infrastructure Lifecycle Process (ILCP) and support policies,

procedures (SOPs), and support documentation such as operating guides and

systems functional specifications, the ARCAMIS infrastructure evolves though

its lifecycles of continuous improvement. Below are samples of the IT Policies

and Procedures used.

IT Policies

Page 16

Standard Operating Procedures

Our goal moving forward is to be a completely ITIL shop in the next 12

months. As you can see from the below organizational chart the ARCAMIS

team is logically grouped into a prospective engineering team, and an

administration and support team.

Page 17

Organizational Chart

Customer EngineerLaurel Heights/CB

Executive Director, Information Technology

ManagerCustomer Engineering

Customer EngineerBEA/ITI

Customer EngineerParnassus

Level 1 and 2 Support Team

IT Office and Operations Manager

Customer EngineerParnassus

Systems and Network Architect

Systems and Network Engineer

Server and Network Engineering Team

Level 3 and 4 Support

Systems and Network Engineer

Customer EngineerPittsburgh

4.3. Security Model and Architecture

ARCAMIS is required to meet at minimum the Security Category and Level of

MODERATE for Confidentiality, Integrity, and Availability as defined by the

National Institute of Health. Compliance with this Security Category spans

the entire organization from the initial Concept Proposal phase, through

clinical trial design and approval, into trial operations where patient

information is gathered, including data collection and specimen storage.

Significant amounts of confidential, proprietary and unique patient data are

collected, transferred, and stored in the ARCAMIS infrastructure for analysis

and dissemination by approved parties. Certain parts of the infrastructure are

able to satisfy HIPAA and 21 CFR Part 11 compliance. This becomes

especially important as the ITN and EPGP organizations continue to innovate

and develop new intellectual property which may have significant market

value.

Page 18

Information Security Category Requirements

Exceeding the minimum compliance requirements with this Information

Security Category is achieved by a holistic approach addressing all aspects of

the ARCAMIS personnel, operations, physical locations, networks and

systems. This includes tested, consistently executed, and audited plans,

policies and procedures, and automated, monitored, and logged security

technologies used on a day to day basis. The overall security posture of the

ARCAMIS has many aspects including legal agreements with partners and

employees, personnel background checks and training, organization wide

disaster recovery plans, backup, systems and network security architectures

(firewalls, intrusion detection systems, multiple levels of encryption, etc.),

and detailed documentation requirements.

Consistent with the NIH Application/System Security Plan (SSP) Template for

Applications and General Support Systems and the US Department of Health

and Human Services Official Information Security Program Policy (HHS IRM

Policy 2004-002.001), ARCAMIS maintains a formal information systems

security program to protect the organization’s information resources. This is

called the Information Security and Information Technology Program (ISITP).

ISITP delineates security controls into the four primary categories of

management, operational, technical and standard operating procedures which

structure the organization of the ISITP.

- Management Policies focus on the management of information security

systems and the management of risk for a system. They are techniques and

concerns that are addressed by management, examples include: Capital

Planning and Investment, and Risk Management.

- Operational Policies address security methods focusing on mechanisms

primarily implemented and executed by people (as opposed to systems).

These controls are put in place to improve the security of a particular system

(or group of systems), examples include: Acceptable Use, Personnel

Separation, and Visitor Policies.

Page 19

- Technical Policies focus on security policies that the computer system

executes. The controls can provide automated protection for unauthorized

access or misuse, facilitate detection of security violations, and support

security requirements for applications and data, examples include: password

requirements, automatic account lockout, and firewall policies.

- Standard Operating Procedures (SOPs) focus on logistical procedures

that staff do routinely to ensure ongoing compliance, examples include: IT

Asset Assessment, Server and Network Support, and Systems Administration.

Specifically, the ARCAMIS ISITP includes detailed definitions of the following

Operational and Technical Security Policies.

PERSONNEL SECURITY Background Investigations Rules of Behavior Disciplinary Action Acceptable Use Separation of Duties Least Privilege Security Education and Awareness Personnel Separation

RESOURCE MANAGEMENT Provision of Resources Human Resources Infrastructure

PHYSICAL SECURITY Physical Access Physical Security Visitor Policy

MEDIA CONTROL Media Protection Media Marking Sanitization and Disposal of Information Input/Output Controls

COMMUNICATIONS SECURITY Voice Communications Data Communications Video Teleconferencing Audio Teleconferencing Webcast Voice-Over Internet Protocol Facsimile

WIRELESS COMMUNICATIONS SECURITY

Wireless Local Area Network (LAN) Multifunctional Wireless Devices

EQUIPMENT SECURITY Workstations Laptops and Other Portable Computing Devices Personally Owned Equipment and Software Hardware Security

ENVIRONMENTAL SECURITY Fire Prevention Supporting Utilities

DATA INTEGRITY Documentation

NETWORK SECURITY POLICIES Remote Access and Dial-In Network Security Monitoring Firewall System-to-System Interconnection Internet Security

SYSTEMS SECURITY POLICIES Identification Password Access Control Automatic Account Lockout Automatic Session Timeout Warning Banner Audit Trails Peer-to-Peer Communications Patch Management Cryptography Malicious Code Protection Product Assurance E-Mail Security Personal E-Mail Accounts

Page 20

These policies serve as the foundation of the ARCAMIS Standard Operating

Procedures and technical infrastructure architectures which when combined,

create a secure environment based security best practices.

Security Infrastructure Architecture

To ensure a hardened Information Security and Information Technology

environment, the ARCAMIS has centralized its critical Information Technology

infrastructures into two Tier 1 data centers. Facilities include: Uninterruptible

Power Supply via backup diesel generators that can keep servers running

indefinitely without direct electric grid power. They are equipped with optimal

environment controls, including sophisticated air conditioning and humidifier

equipment as well as stringent physical security systems. They provide 24x7

Network Operations Center network monitoring and physical security. Each

data center also includes fire suppression systems with water-free fire

protection so as not to damage the servers.

For secure data transport, ARCAMIS provides a carrier diverse, redundant,

secure, reliable, Internet connected, high speed Local Area Network (LAN)

and Wide Area Network (WAN). The ARCAMIS network and Virtual Private

Network (VPN) is the foundation for all the ARCAMIS IT services and used by

every ITN and EPGP stakeholder every day. The high speed WAN is protected

by intrusion detection monitored and logged firewalls at all locations. Firewall

and VPN services are provided by industry leading Microsoft and Cisco

products. All network traffic between ITN sites, desktops, and partner

organizations that travels over public networks is encrypted using at least

128-bit encryption using various security protocols including IPSec, SFTP,

RDC, Kerberos, and others. We also implemented a wildcard based virtual

certificate architecture for all port 443 communications, allowing rapid

deployment of new secured services.

Keeping these systems monitored and patched, ARCAMIS provides IP ping,

SNMP MIB monitoring, specific service monitoring and automated restarts,

hardware monitoring, intrusion detection monitoring, and website monitoring

Page 21

of the ARCAMIS production server environment. Server and end-user

security patches are applied monthly via Software Update Services.

Application and LINUX/Macintosh patches are pushed out on a monthly basis.

We have standardized on McAfee Anti-Virus for virus protection and use

Postini for e-mail SPAM and Virus filtering.

The ITN’s Authoritative Directory uses Microsoft Active Directory and is

exposed via SOAP, RADIUS, and LDAP for cross platform authentication. The

ITN is currently using an Enterprise Certificate Authority (ITNCA) for

certificate based security authentication.

Comprehensive Information Security

The ITN has established mandatory policies, processes, controls, and

procedures to ensure confidentiality, integrity, availability, reliability, and

non-repudiation within the Organization’s infrastructure and its operations. It

is the policy of ARCAMIS that the organization abides by or exceeds the

requirements outlined in ITN Information Security and Information

Technology Program, thereby exceeding the required Security Category and

Level of MODERATE for Confidentiality, Integrity, and Availability outlined

above. In addition, to ensure adequate security, ARCAMIS implements

additional security policies exceeding the minimum requirement, as

appropriate for our specific operational and risk environment as necessary.

4.4. Data Center Facilities

The ITN has centralized its server architecture into two Tier 1 data centers.

The first is located in Herndon, VA with Cogent Communications, and the

second in San Francisco, CA with Level 3 Communications. An additional

research data center is located at the UCSF QB3 facility. Physical access

requires a badge and biometric hand security scanning, and the facilities have

24x7 security staff on-site. Each data center includes redundant

uninterruptible power supplies and backup diesel generators that can keep

each server running indefinitely without direct electric grid power. The centers

provide active server and application monitoring, helping hands and backup

media rotation capabilities. They are equipped with optimal environment

Page 22

controls, including sophisticated air conditioning and humidifier equipment as

well as stringent physical security systems. There are also waterless fire

suppression systems. Power to our racks specifically is provided by four

redundant, monitored PDUs which report exact power usage at a point in time

and alert us if there is a power surge.

Herndon, VA Rack Diagram

G3

HPProLiantML570

UID

21

Channel 2Channel 2Channel 1

100

1

2

3

4

5

6

7

G3

HPProLiantML570

UID

21


100

1

2

3

4

5

6

7

G3

HPProLiantML570

UID

21


100

1

2

3

4

5

6

7

UID

1

2

Sim

plex

Dup

lex

chch

21

00

113

32

2

44

55

Tap

e

UID

1

2

Sim

plex

Dup

lex

chch

21

00

113

32

2

44

55

Tap

e

UID

1

2

Sim

plex

Dup

lex

chch

21

00

113

32

2

44

55

Tap

e

UID

1

2

Sim

plex

Dup

lex

chch

21

00

113

32

2

44

55

Tap

e

UID

1

2

Sim

plex

Dup

lex

chch

21

00

113

32

2

44

55

Tap

e

NetApp

FAS 3020


NetApp

FAS 3020


UID

HPProLiantDL320

G3

1 2

NetworkAppliance

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14MK2

FCNetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance

72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

NetworkAppliance

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14MK2


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

NetworkAppliance

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14MK2


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

NetworkAppliance

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14MK2


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

UID

HPProLiantDL320

G3

1 2

UID

HPProLiantDL320

G3

1 2

NetworkAppliance

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14MK2


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

NetworkAppliance

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14MK2


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

NetworkAppliance

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14MK2


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

NetworkAppliance

Power

System

Shelf ID

Loop B

Fault

Loop A

72F

DS14MK2


72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F

G3

HPProLiantML570

UID

21


100

1

2

3

4

5

6

7

G3

HPProLiantML570

UID

21


100

1

2

3

4

5

6

7

Page 23

4.5. Internet Connectivity

Servicing the ARCAMIS customer base is a carrier diverse, redundant,

firewalled, reliable, Internet connected high speed network. This network

combined with the Virtual Private Network (VPN) creates the foundation for all

the ARCAMIS services provided.

Internet connectivity is location dependant:

• San Francisco, China Basin and Level 3. – Tier 1 1000mbs Ethernet

connection to the Internet is provided by Cogent Networks. UCSF

provides a 100mbs Ethernet connection to redundant 45mbs OC198

connections, and 100mbs Ethernet to between UCSF campuses.

• San Francisco, Quantitative Biology III Data Center – UCSF network

provides 1000mbs Ethernet connection to redundant 45mbs OC198

connections and 100mbs Ethernet to between UCSF campuses.

• Herndon, VA – Tier 1 100mbs Ethernet connection to the Internet is

provided by Cogent Networks. AT&T provides a 1.5mbs DSL backup

connection.

4.6. Virtual CPU, RAM, Network, and Disk Resources

ARCAMIS uses the Network Appliances Storage Area Network with a 25 TB

HA Cluster in Herndon and a 25 TB disaster recovery site in San Francisco.

This allows us to reduce cost and complexity via automation and operations

efficiency. We can seamlessly control adds, removes, and updates without

business interruption for our critical storage needs. We can more efficiently

use what we already own and eliminate silos of underutilized memory, CPU,

network, and storage. This improves business scalability and agility via

accelerated service deployment and expansion on existing hardware assets.

We can scale to tens of terabyte storage, not possible with a server based

approach. Another key result of using this technology is risk mitigation. We

architecturally automate the elimination of the possibility of critical data loss.

We fully automate backup, archival, restore - productivity loss goes from days

to minutes, to nothing, in event of user error or HW failure. We have

technologically automated smooth business continuance in the event of a

disaster. The increasing ARCAMIS data security and compliance

requirements are able to be met with a SAN. We can handle HIPAA security

Page 24

and compliance requirements. In our experience, storage availability

determines service availability; automation guarantees service quality.

VMware Virtual Infrastructure Enterprise 3.01 (VI3) is virtual infrastructure

software for partitioning, consolidating and managing servers in mission-

critical environments. Ideally suited for enterprise data centers, VI3

minimizes the total cost of ownership of computing infrastructure by

increasing resource utilization and its hardware-independent virtual machines

encapsulated in easy-to-manage files maximize administration flexibility.

VMware ESX Server allows enterprises to: boost x86 server utilization to 60-

80%, provision new systems faster with reduced hardware; decouple

application workloads from underlying physical hardware for increased

flexibility; and dramatically lower the cost of business continuity. ESX Server

supports 64-bit VMs with 16GB of RAM, meeting ARCAMIS’s expanding

server computing requirements.

Combining the SAN with server virtualization provides an extremely reliable,

extensible, manageable, high availability architecture for ARCAMIS. The SAN

provides instantaneous VM backups, restores and provisioning and off-site

disaster recovery and archival. File restores are instantaneous, eliminating

the need for human resource intensive and less reliable client side disk

management applications. Adjusting to changing server requirements is

instantaneous because of the SAN’s storage expansion and reduction

capability for live volumes. Also, oversubscription allows the ITN to

significantly more efficiently use the disk we already own. The SAN VMWare

ESX combination provides excellent performance and reliability using both

Fiber Channel and iSCSI Multi-pathing. VMs boot from the SAN are replicated

locally and off-site while running. For certain applications we can create

Highly Available Clustered Systems even greater than 99.998% uptime.

Finally, server maintenance can be done during regular working hours without

downtime due to support for VMotion, the ability to move a running VM from

one physical machine to another.

Page 25

4.7. Operating Systems Supported

ARCAMIS supports several operating systems, including all flavors of LINUX,

i386 Solaris, and all versions of the Windows operating system.

4.8. Backup, Archival, and Disaster Recovery

ARCAMIS Data Availability, Backup, and Archival is provided by a Storage

Area Network (SAN) with a 25 TB High Availability Cluster in Herndon and a

25 TB disaster recovery site in San Francisco. This SAN houses ARCAMIS

critical clinical data and IT server data. The SAN automates backup, archival,

and restore via NetApp SnapMirror, SnapBackup, and SnapRestore

applications. All critical data at the San Francisco and Herndon sites are

replicated to the other site within 1 hour. In the event of a major disaster at

any ARCAMIS datacenter site, only minimal data (60 minutes) loss can occur

and the critical server infrastructure can be failed over to the other coast’s

facility for business continuance. In addition to the SAN, the ARCAMIS uses

a 7 day incremental backup to offline disk rotation with monthly off-site

stored archives for all production data based on Symantec Veritas software.

4.9. Monitoring, Alerting, and Reporting

We use various monitoring and report technologies and two IT staff do full

infrastructure monitoring audits twice daily, 5 days per week, once at 8:00am

EST and again at 3:00pm PST. We use a 1-800, Priority 1 issue resolution

line that pages and calls 5 senior engineers simultaneously in the event of a

major system failure or issue. We use an on-call rotation schedule that

changes weekly. We use the following technologies: Microsoft Operations

Manager (MOM), WebWatchBot, Brocade Fabric Manager, NetApp Operations

Manger, VMWARE Operations Manager, Cacti, and Oracle, among others.

Cacti Disk Utilization Graph

Below is a sample disk utilization graph.

Page 26

Monitoring Table

Below is a partial list of monitoring we do.

Server 1 Network 1 Customer Defined Transaction Monitoring ODBC Database Query Verification Ping Monitoring SMTP Server and Account Monitoring POP3 Server and Account Monitoring FTP Upload/Download Verification File Existence and Content Monitoring Disk/Share Usage Monitoring Microsoft Performance Counters Microsoft Process Monitoring Microsoft Services Performance Monitoring Microsoft Services Availability Monitoring Event Log Monitoring HTTP/HTTPS URL Monitoring Customer Specified Port Monitoring Active Directory Exchange Intelligent Message Filter HP ProLiant Servers Microsoft .NET Framework Microsoft Baseline Security Analyzer Microsoft Exchange Server Best Practices Analyzer Microsoft Exchange Server Microsoft ISA Server Microsoft Network Load Balancing Microsoft Office Live Communications Server 2003 Microsoft Office Live Communications Server 2005 Microsoft Office Project Server Microsoft Office SharePoint Portal Server 2003 Microsoft Operations Manager MPNotifier Microsoft Operations Manager

Page 27

Microsoft Password Change Notification Service Microsoft SQL Server Microsoft Web Sites and Services MP Microsoft Windows Base OS Microsoft Windows DFS Replication Microsoft Windows Distributed File Systems Microsoft Windows Distributed File Systems Microsoft Windows DHCP Microsoft Windows Group Policy Microsoft Windows Internet Information Services Microsoft Windows RRAS Microsoft Windows System Resource Manager Microsoft Windows Terminal Services Microsoft Windows Ultrasound NetApp Volume Utilization Global Status Indicator Hardware Event Log Visual Inspection Ambient Temperature Temperature Trending Location WAN Connectivity

4.10. IT Service Management Systems

We use Remedy and Track-IT Enterprise for Ticketing, Asset Tracking, and

Purchasing.

Page 28

5. Implementation Timeframe

5.1. Project Timeline

Page 29

6. Customer Testimonials

“ARCAMIS provides services that allow the ITN knowledge workers to focus on answering the difficult scientific questions in immune tolerance; we don’t waste time on basic IT infrastructure functions. ARCAMIS allows me to be confident our research patient data is stored in a secure, reliable and responsive IT infrastructure. For example, last week we did a demonstration to the Network Executive Committee of our Informatics data management and collaboration portal in real-time. This included the National Institute of Health senior management responsible for our funding… it all worked perfectly. This entire application was built on ARCAMIS.” Jeffrey A. Bluestone, Ph.D. Director, UCSF Diabetes Center Director, Immune Tolerance Network A.W. and Mary Clausen Distinguished Professor of Medicine, Pathology, Microbiology and Immunology “With ARCAMIS we are well positioned to meet the rigorous IT requirements of an NIH funded study. Within weeks of project funding from the NIH, our entire secure research computing network and server infrastructure of more than 10 servers was built, our developers finished the public website, and we began work on the Patient Recruitment portal. That would have taken at least 6 months if I had to hire a team to procure and build it ourselves. Accelerating scientific progress in neurology is core to everything we do; ARCAMIS has been an important part of what we are currently doing.” Dr. Daniel H. Lowenstein, M.D. Professor of Neurology, UCSF and Director, Physician-Scientist Education and Training Programs Director, Epilepsy Phenome Genome Project “With the investment in ARCAMIS, UCSF and the ITN can confidently partner with other leading medical research universities across the country. At the ITN we depend on the on-demand, services based, scalable computing capacity of ARCAMIS every day to enable our collaborative data analysis and Informatics data visualization applications.” Mark Musen, Ph.D. Director, Medical Informatics Department Stanford University Deputy Director, Immune Tolerance Network

Page 30

Appendices

Appendix A – Capabilities Summary of the ARCAMIS Suite

Fundamentals

• 99.998% production solution uptime guaranteed via Service Level

Agreement.

• Managed multi-homed, Tier 1 network (Zero Downtime SLA)

• High speed 1000mbs connectivity to UCSF network space.

• Bi-costal world-class data centers hosted with Level 3 and Cogent

communications with redundant power and HVAC systems

• Managed DNS or use UCSF DNS

• Managed Active Directory for “Production Servers” and integration with

UCSF CAMPUS AD via trust.

• Phone, e-mail and web based ticketing system to track all issues

• Mature purchasing services with purchases charged to correct account

Monitoring & Issue Response

• 8am EST to 5pm PST business day access to live support personnel

• 24/7/365 with one primary “on call” engineer, paging off hours access,

with a 1-800 P1 issue number that rings 5 infrastructure engineers

simultaneously.

• Microsoft Operations Manager monitoring (CPU, RAM, disk, event log,

ping, ports and services)

• Application script response monitoring for web applications, including

SSL via WebWatchBot 5

• HP Remote Insight Manager hardware monitoring with 4 hour vendor

response on all servers

• NetApp corporate monitoring and 4 hour time to resolution will fully

stocked parts depot on Storage Area Network.

• 24x7 staffed datacenters with secure physical access to all servers

• 24x7 staffed Network Operating Center for WAN

Page 31

• Notification preferences and standard response specifications can be

customized

Backup, Restore and Disaster Recovery/Business Continuance

• Symantec Backup Exec server agents for Oracle, SQL, MySQL, and

Exchange servers with 7 nightly incremental backups.

• 14 local daily snapshots of full “crash consistent” server state

• Hourly off-site snapshots of full “crash consistent” server state with 40

hourly restore points for DR

• Monthly archive of entire infrastructure, that rolls to quarterly after 3

months.

Reporting

• Online Ticketing

• Detailed Backup Utilization

• Bandwidth Utilization

• Infrastructure uptime reports

• CPU, RAM, Network, and Disk utilization reports

Server & Device Administration

• Customized Specifications using VMWare Infrastructure 3.01

technology up to 4 64-Bit, 3.0Ghz Intel Xeon Processors, 16GB RAM,

1gbs Network with 2TB disk volumes max.

• Based on HP Proliant Enterprise servers. ML570 8 processors per

server and DL380 series. 7000c Blade servers

• IP everywhere, full remote management of every device, including full

KVM via separate backLAN network.

• Microsoft MCCA licensing on key server components,

• Full license and asset tracking

• Senior System Administrator troubleshooting

• Optional high availability (99.999% uptime) server capabilities via

Veritas and Microsoft Clustering

Managed Security

Page 32

• Automated OS and major application patching

• Managed Network-based Intrusion Detection

• Managed policy based enterprise firewall using Cisco and Microsoft

technologies

• Managed VPN access

Page 33

Appendix B – Excerpt from the ARCAMIS Systems Functional Specification

Centralized Virtual Infrastructure Administration ARCAMIS can move virtual machines between hosts, create new machines from pre-built templates, and control existing virtual machine configurations. We also can gather event log information from a central location for all VMware hosts; have an increased ability to identify asset utilization and troubleshoot warnings prior to problems occurring; have easier management of physical system bios updates and firmware upgrades; and have centralized management of all virtual machines within the network. The Virtual Center management interface allows us to centrally manage and monitor our entire physical and virtual infrastructure from one place:

Hosts Clusters and Resource Pools: By organizing physical hosts into clusters of two or more, we are able to distribute the aggregate resources as if it were one physical host. For example a single server might be configured with 4 dual core 2.7 GHz processors and 24 GB of RAM. By clustering two servers together, the resources are presented as 21 GHz and 48 GB of RAM which can be provisioned as needed to multiple guests.

Page 34

DRS and VMotion: VMotion enables us to migrate live servers from one physical host to another which allows for physical host maintenance to be performed with no impact to production service uptime. Dynamic Resource Scheduling (DRS) is used to set different resource allocation policies for different classes of services which are automatically monitored and enforced using the aggregate resources of the cluster.

Documents

UCSF07 - Research and HPC Infrastructure_Award_2007