Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
National Archives and Records Administration
National Archives Catalog (The Catalog)
NARA Catalog System Design– Catalog Perspective –
Status-FinalVersion 1.6July 8, 2015
NARA Catalog System Design
National Archives & Records Administration
NARA Catalog System Design
Version 1.6
Contract Number GS-35F-0541U
Order Number NAMA-13-F-0120
July 8, 2015
NARA Catalog System Design
Contents
1 Overview................................................................................................................51.1 High-Level Architecture.....................................................................................................6
1.2 NARA Catalog in Context...................................................................................................7
1.2.1 NARA Catalog Production..........................................................................................7
1.2.2 NARA Catalog Sandbox..............................................................................................9
1.3 Applicable Requirements.................................................................................................10
1.3.1 Sandbox Environment and Segregated Storage......................................................10
1.3.2 Performance Requirements....................................................................................10
1.3.3 Availability...............................................................................................................12
1.3.4 Volume....................................................................................................................13
1.3.5 Security Requirements............................................................................................14
2 Hardware and Network Design.............................................................................222.1 Production System...........................................................................................................22
2.1.1 Assumptions............................................................................................................22
2.1.2 Server Hardware.....................................................................................................22
2.1.3 NARA Catalog Storage Hardware............................................................................28
2.1.4 Network Hardware..................................................................................................28
2.2 Sandbox Environment......................................................................................................30
2.3 Development System.......................................................................................................31
2.4 UAT System......................................................................................................................32
2.4.1 UAT to PROD Proceedure........................................................................................32
2.5 Example 2014 and 2015 NARA Catalog Prod Computations............................................33
2.5.1 Example Server Requirements................................................................................33
2.5.2 Elastic Scalability.....................................................................................................34
2.5.3 Unknowns...............................................................................................................35
2.5.4 Computing Server Requirements for Index Entries of Varying Size.........................35
3 Operating System Design......................................................................................373.1 Kernel Configuration........................................................................................................37
3.2 Memory Configuration....................................................................................................37
NARA Catalog System Design
3.3 Accounts..........................................................................................................................37
3.4 Auditing...........................................................................................................................38
3.5 Ports Configuration..........................................................................................................38
3.6 Clock Synchronization......................................................................................................38
3.7 SSH...................................................................................................................................39
3.8 Maintaining and Patching the Operating System.............................................................39
4 Storage Design......................................................................................................404.1 Storage Technology for NARA Catalog Prod....................................................................40
4.1.1 Version 1.................................................................................................................40
4.1.2 Version 2.................................................................................................................41
4.2 Structure..........................................................................................................................42
4.2.1 Project Directories...................................................................................................43
4.2.2 NAID Directories / Separate Environments.............................................................44
4.2.3 SFTP Server Access..................................................................................................46
5 Backups & Recovery.............................................................................................475.1 Backups............................................................................................................................47
5.1.1 Backup Schedules....................................................................................................47
5.1.2 Backup Details.........................................................................................................47
5.1.3 Backup Storage.......................................................................................................48
5.1.4 Backup for NARA Catalog Storage...........................................................................48
5.2 Recovery from Server Failure...........................................................................................48
5.2.1 Database Servers.....................................................................................................48
5.2.2 Content Processing / Ingestion Servers...................................................................48
5.2.3 Search Engine Servers.............................................................................................49
5.2.4 Application Servers.................................................................................................49
5.2.5 Reporting, Monitoring & Admin Control.................................................................49
5.3 Recovery from Site Failure...............................................................................................49
6 System Monitoring...............................................................................................50
NARA Catalog System Design
Version Control
Version Date Reviewer Summary Description
1.0 2014-03-02 Paul Nelson Complete first version for NARA review
1.1 2014-03-16 Paul Nelson Incorporate changes from DCRF
1.2 2014-04-11 Madhu Koneni Adjusted the servers configuration based on what AWS provides
1.3 2014-05-21 Paul Nelson Updates from NARA SE Architecture review
1.4 2014-11-14 Kristy Martin Removed “Confidential to Search Technologies” text from the footer
1.5 2014-11-24 Brandon Stahl Replaced https://research.archives.gov url with https://catalog.archives.gov url
1.6 2015-07-08 Brandon Stahl Rebranded OPA as NARA Catalog
5
NARA Catalog System Design
1 Overview
This document is the system design including hardware specifications for the National Archives Catalog system currently being developed for the National Archives and Records Administration (NARA).
Specifically, this document will cover:
Server requirements for NARA Catalog Production, including:
o Server machines
o Server specifications
Disk space requirements for NARA Catalog Production, including:
o Type of disk space
o Size and I/O access requirements
Networking requirements for NARA Catalog Production, including:
o Network connectivity to the internet
o Network connectivity to NARANet
o Load-balancing / routing
Requirements for other NARA Catalog Systems, including:
o The sandbox environment
o The developer environment
o UAT environment
Other system tools, including:
o SFTP service for ingestion of digital objects
o System monitoring
6
NARA Catalog System Design
1.1 High-Level ArchitectureThe following diagram provides an overview of all NARA Catalog systems:
Content Processing Search Array
Annotations and Registration
Database
DAS
OPAApplication
Server
Sear
ch A
PIAn
nota
tions
API
Auth
entic
ation
API
OPAUser
Interface
(Twitter Bootstrap,Angular.js)
Web Browser
AuthorizedUsers
Interface
Auth
orize
d U
sers
API
Multiple pipelinesfor multiple content processing
data flows
Acce
ss A
PI
...
...
OPA Storage
Internet
sftpContentSubmissions
ContentSubmissions
Bulk-Exportsserver
The purpose of each system is as follows:
Content Processing – is back-end system responsible for ingestion, maintaining NARA Catalog storage, and keeping the search engine indexes up-to-date.
Search Array – Is the search engine itself, structured as a series of independent search nodes, each one responsible for searching a portion of the entire index (index portions are called “shards” and should hold around 25-50 million records). Each search node has a redundant copy to increase query capacity and for failover.
NARA Catalog Storage – Is the long-term content storage for all publicly available NARA data. NARA Catalog storage will contain a copy of NARA data so that it can be delivered quickly and efficiently to the public.
Annotations and Registration Database – Contains registered user account information, tags, comments, transcriptions and translations as well as bookkeeping information for all annotations such as lists of recently created or modified annotations, annotations per user, etc.
Application Server – This is the system (likely made up of multiple servers) which handles all end-user and authorized user requests.
Client – Client software will be written in Javascript and HTML-5 and will run inside the user’s web browser.
7
NARA Catalog System Design
1.2 NARA Catalog in ContextThe following diagrams show how NARA Catalog fits “into context” with the other processes and systems in the National Archives.
There will be two contextual diagrams, one for NARA Catalog Production, and a second for the NARA Catalog Sandbox system. (Note: These diagrams are preliminary)
1.2.1 NARA Catalog Production
The following diagram shows how NARA Catalog Production fits into the rest of the NARA environment:
OPA Production
OPA Storage
Annotations and Registration
Database
Search Engine
Ingestion and Content
ProcessingApplication
Server
TrustedRepository
(future)
DigitalProcessing
Environment(DPE)
Description & Authority Service
(DAS)
NARA Moderator
NARA Moderator
NARA AuthorizedUser
NARA AuthorizedUser
Non-professional Users
Non-professional Users
ResearchersResearchers
NARAResearch Support
Services
NARAResearch Support
Services
Content OwnersContent Owners
ContentUpdates
Third PartyAPI User
1.2.1.1 Content Providers
In the above diagram, systems which provide content (or are anticipated to someday provide content) are shown on the left, and consumers of NARA Catalog Production services are shown on the right.
NARA Catalog Production will receive updates from The NARA Description & Authority Service (DAS) as well as the Digital Processing Environment (DPE) through the Trusted Repository (TBD) (only trusted content with full digital provenance should be ingested into NARA Catalog).
Content updates will be provided by content owners as new files are scanned and/or content modifications are required. This can include storage of content (for example, from digitization partners) for future processing.
8
NARA Catalog System Design
1.2.1.2 Consumers of NARA Catalog Services
Users of NARA Catalog include the following categories. Of course, any single person can be in all of these roles (the categories are not exclusive):
Non-Professional Users – These are members of the public who are professional or academic researchers. These users fall into various categories:
o Occasional searchers – Log into NARA on occasion for occasional searching, for example to find family members or fellow soldiers.
o Contributors – These are users who help contribute to the archive with comments, tags, transcriptions, or translations.
Researchers – Researchers are looking for specific source materials for specific research goals. For example, to research a biography of a famous politician.
Third Party API users – These are third party organizations that wish to interact programmatically with the NARA Catalog system, for example to bulk export images (The Digital Public Library of America, or Wikipedia) or to create new custom interfaces for searching NARA Catalog content.
NARA Research Support Services – These are NARA employees who help researchers. It is expected they will be users of NARA Catalog to help the public find information.
NARA Contribution Moderators – These are NARA employees who review contributions from the public. Content which is spam or vandalism will be removed (with a comment).
NARA Authorized Users – These are NARA employees responsible for managing the user account database. They can deactivate and re-activate registered users and respond to support call requests (e.g. change my password, etc.).
9
NARA Catalog System Design
1.2.2 NARA Catalog Sandbox
The purpose of NARA Catalog Sandbox is to “trial run” new content before it is posted on-line to the general public. In this capacity, it will have a different set of content providers and consumers, as shown below:
OPA Sandbox
OPA Sandbox Storage
Search Engine
Ingestion and Content
ProcessingSandbox
ApplicationServerDigital
ProcessingEnvironment
(DPE)
Description & Authority Service
(DAS)
NARAArchivists
NARAArchivists
OPA ReviewBoard
OPA ReviewBoard
ContentOwnersContentOwners
Content OwnersContent Owners
ContentUpdates
It is expected that DPE will provide content directly to the NARA Catalog sandbox, so the content can be tested in NARA Catalog before it is written to the trusted repository. Similarly, NARA Catalog sandbox will need to pull description data from DAS, as it would normally need to do for any sort of content ingestion.
NARA Catalog sandbox is not available for public consumption. Instead, the NARA Catalog sandbox application will be used only by:
Content Owners – Who need to view and test their content in NARA Catalog Sandbox so they can verify its accuracy before it is moved on-line for the public.
NARA Archivists – Who will need to view the content as well, to ensure that it meets archival standards.
The NARA Catalog Review Board – An interdisciplinary group who verifies the quality of the content (and the data description files) as necessary before content can be officially moved to the public.
10
NARA Catalog System Design
1.3 Applicable RequirementsThe requirements which drive the system design are identified in the following table along with the section of the document to which the requirement is allocated.
1.3.1 Sandbox Environment and Segregated Storage
Requirement Requirement Text Section
2.3.1 The NARA Catalog system shall provide a sandbox for a data producer to deposit records.
2.2
2.3.1.1 The sandbox shall allow indexing of the deposited records. 2.22.3.1.2 The sandbox shall allow searching of the deposited records by authorized users. 2.22.3.2 The sandbox shall index records that are not yet released for search by public
users.2.2
2.3.3 The NARA Catalog system shall exclude records from the search that are not yet released for public access.
2.2
2.3.3.1 The NARA Catalog system shall provide the capability for a System Administrator to set an embargo date on data that will not be available to the public.
4.2
2.3.3.2 The NARA Catalog system shall have a segregated storage space for digital objects that are not yet publicly available for search.
4.2
2.10 The NARA Catalog system shall provide a staging area to store SEIP packages that do not contain a description in DAS.
4.2
2.11 The NARA Catalog system shall provide a staging area to store digital objects that do not contain a description in DAS.
4.2
1.3.2 Performance Requirements
Requirement Requirement Text Section
10.1 The NARA Catalog system shall have response times for returning a search result.
2.1.2
10.1.1 The NARA Catalog system response time for returning a search results shall be less than 1 second for 90% of queries, not including network transfer to/from the browser.
2.1.2
10.1.2 The NARA Catalog system response time for returning a search results shall be less than 2 second for 98% of queries, not including network transfer to/from the browser.
2.1.2
11
NARA Catalog System Design
10.1.3 The NARA Catalog system response time for returning a search results shall be less than 3 second for 99% of queries, not including network transfer to/from the browser.
2.1.2
10.1.4 The NARA Catalog system response time for returning a search results shall be less than 5 second for 99.99% of queries, not including network transfer to/from the browser.
2.1.2
10.2 The NARA Catalog system shall have response times for navigating between screens.
2.1.2
10.2.1 The NARA Catalog system response times for navigating between screens when no search is involved shall be 99% within 1 second, not including network transfer to/from the browser.
2.1.2
10.2.2 The NARA Catalog system response times for navigating between screens when no search is involved shall be 99.99% within 2 seconds, not including network transfer to/from the browser.
2.1.2
10.2.3 The NARA Catalog system response times for navigating from page to page of search results shall be less than 1 second for 90% of queries, not including network transfer to/from the browser.
2.1.2
10.2.4 The NARA Catalog system response times for navigating from page to page of search results shall be less than 2 seconds for 98% of queries, not including network transfer to/from the browser.
2.1.2
10.2.5 The NARA Catalog system response times for navigating from page to page of search results shall be less than 5 seconds for 99.99% of queries, not including network transfer to/from the browser.
2.1.2
10.2.6 The NARA Catalog system response times for navigating between screens that are not search results shall be a maximum of one (1) second.
2.1.2
10.3 The NARA Catalog system shall be capable of supporting at a minimum one (1) million user accounts.
2.1.2.1
10.4 The NARA Catalog system shall be able to provide sustained query performance of no less than sixty (60) queries per second for queries executed in sequence.
2.1.2
10.4.1 The NARA Catalog system shall provide a procedure for increasing query capacity (queries per second) as needed to handle expected capacity increases, with a maximum required lead time of two (2) weeks.
2.5.2
10.4.2 The NARA Catalog system shall provide a procedure for decreasing query capacity (queries per second) when increased query capacity is no longer required, but not less than the base query capacity provided at production launch.
2.5.2
10.5 <allocated to NARA Catalog Search Engine Design>10.6 The NARA Catalog system shall support normal traffic, at a minimum, two-
thousand (2,000) concurrent users.2.1.2
10.7 The NARA Catalog system shall support surge traffic, at a minimum, twenty-thousand (20,000) concurrent users.
2.1.2
10.8 The NARA Catalog system shall be able to provide peak query performance of one-hundred (100) queries per second, for queries executed in sequence.
2.1.2
12
NARA Catalog System Design
1.3.3 Availability
Requirement
Requirement Text Section
11.1 The NARA Catalog system shall be 99.5% available for search and other processing 24 hours a day/7 days a week.
5.2
11.2 The NARA Catalog system shall be able to recover system functionality after a failure.
5.2
11.2.1 The NARA Catalog system shall be able to recover search functionality from any hardware failure within the NARA Catalog system within 90 minutes.
5.2
11.2.2 The NARA Catalog system shall be able to recover search functionality from any software failure within the NARA Catalog system within 90 minutes.
5.2
11.2.3 The NARA Catalog system shall be able to recover tags from any hardware failure within the NARA Catalog system within 90 minutes.
5.2
11.2.4 The NARA Catalog system shall be able to recover tags from any software failure within the NARA Catalog system within 90 minutes.
5.2
11.2.5 The NARA Catalog system shall be able to recover comments from any hardware failure within the NARA Catalog system within 90 minutes.
5.2
11.2.6 The NARA Catalog system shall be able to recover comments from any software failure within the NARA Catalog system within 90 minutes.
5.2
11.2.7 The NARA Catalog system shall be able to recover translations from any hardware failure within the NARA Catalog system within 90 minutes.
5.2
11.2.8 The NARA Catalog system shall be able to recover translations from any software failure within the NARA Catalog system within 90 minutes.
5.2
11.2.9 The NARA Catalog system shall be able to recover transcriptions from any hardware failure within the NARA Catalog system within 90 minutes.
5.2
11.2.10 The NARA Catalog system shall be able to recover transcriptions from any software failure within the NARA Catalog system within 90 minutes.
5.2
11.2.11 The NARA Catalog system shall be able to recover login functionality from any hardware failure within the NARA Catalog system within 90 minutes.
5.2
11.2.12 The NARA Catalog system shall be able to recover login functionality from any software failure within the NARA Catalog system within 90 minutes.
5.2
11.2.13 The NARA Catalog system shall be able to recover API functionality from any hardware failure within the NARA Catalog system within 90 minutes.
5.2
11.2.14 The NARA Catalog system shall be able to recover API functionality from any software failure within the NARA Catalog system within 90 minutes.
5.2
11.2.15 The NARA Catalog system shall be able to recover the ingest functionality from any hardware failure within 2 days.
5.2
11.2.16 The NARA Catalog system shall be able to recover the ingest functionality from any software failure within 2 days.
5.2
11.2.17 The NARA Catalog system shall be able to recover the reporting functionality from any hardware failure within 2 days.
5.2
13
NARA Catalog System Design
11.2.18 The NARA Catalog system shall be able to recover the reporting functionality from any software failure within 2 days.
5.2
11.2.19 The NARA Catalog system shall be able to recover authorized user interfaces from any hardware failure within 2 days.
5.2
11.2.20 The NARA Catalog system shall be able to recover authorized user interface functionality from any software failure within 2 days.
5.2
11.2.21 The NARA Catalog system shall be able to recover the information exchange functionality from any hardware failure within 2 days.
5.2
11.2.22 The NARA Catalog system shall be able to recover the information exchange functionality from any software failure within 2 days.
5.2
11.2.23 The NARA Catalog system shall be able to recover all functionality from a site-wide system failure within 7 days, with no more than 48 hours of data loss.
5.3
1.3.4 Volume
Requirement Requirement Text Section
12.1 The NARA Catalog system configuration shall provide the capability to scale on demand.
2.5.2
12.1.1 The NARA Catalog system architecture shall be capable of supporting a minimum of 10,000 terabytes of NARA Catalog source data, and scalable up to 57,000 terabytes of NARA Catalog source data.
2.1.3
12.1.2 <Allocated to Search Engine Design> 412.1.2.1 The NARA Catalog system architecture shall be capable of holding a minimum
of 500 million digital objects.2.1.3,
2.1.2.312.1.3 <Allocated to Search Engine Design>12.1.3.1 The NARA Catalog system architecture shall be capable of holding a minimum
of 20 million archival description records.2.1.2.3
12.1.4 The NARA Catalog system architecture shall be capable of supporting a minimum of 20 million authority records.
2.1.2.3
12.1.4.1 The NARA Catalog system architecture shall be capable of holding a minimum of 10 million authority records.
2.1.2.3
14
NARA Catalog System Design
1.3.5 Security Requirements
The following security requirements are allocated to the system design.
Requirement Requirement Text Section
13.1 The NARA Catalog system shall be implemented in compliance with NARA security guidance as provided by NARA in the NARA Catalog and Cloud Service Provider Baseline Security Controls.
3
13.1.1 The NARA Catalog system shall be delivered with any guest accounts disabled for COTS products installed on the system. (1.1 Access Control, AC-2)
3.3
13.1.2 The NARA Catalog system shall automatically terminate temporary and emergency accounts after a period not to exceed 15 days for unclassified information systems (1.1 Access Control, AC-2 (2))
3.3
13.1.3 The NARA Catalog system shall automatically disable inactive accounts after [a period not to exceed 365 days]. (1.1 Access Control, AC-2 (3))
3.3
13.1.4 The NARA Catalog system shall automatically audit account creation, modification, disabling, and termination actions and notifies, as required, appropriate individuals. (1.1 Access Control, AC-2 (4))
3.4
13.1.5 The NARA Catalog system shall isolate the programs and data areas of users from other users and the system itself. (1.1 Access Control, AC-3)
3, 4.2
13.1.5.1 The NARA Catalog system shall provide for the capability to enforce role-based access control policies. (1.1 Access Control, AC-3)
3.3
13.1.6 The NARA Catalog system shall enforce approved authorizations for controlling the flow of information within the system and between interconnected systems in accordance with applicable policy. (1.1 Access Control, AC-4)
3.5, 2.1.4
13.1.7 The NARA Catalog system shall provide the capability to enforce the concept of least privilege, allowing only authorized accesses for users (and processes acting on behalf of users) which are necessary to accomplish assigned tasks in accordance with NARA missions and business functions. (1.1 Access Control, AC-6)
4.2, 3.3
13.1.7.1 The NARA Catalog system shall be able to enforce restrictions for access to security-related functions. (1.1 Access Control, AC-6 (1)) Examples of security functions include but are not limited to: establishing system accounts, configuring access authorizations (i.e., permissions, privileges), setting events to be audited, system programming, system and security administration, and other privileged functions.
3.3
13.1.8 The NARA Catalog system shall enforce a limit of [a maximum of 5] consecutive invalid login attempts by a user during a [15 minute period]. (1.1 Access Control, AC-7a)
3.3
13.1.8.1 The NARA Catalog system shall automatically [locks the account/node for at least 15 minutes ] when the maximum number of unsuccessful attempts is exceeded. (1.1 Acces Control, AC-7b)
3.3
15
NARA Catalog System Design
13.1.9 The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that provides privacy and security notices consistent with applicable federal laws, Executive Orders, directives, policies, regulations, standards, and guidance. (1.1 Access Control, AC-8a)
3.3
13.1.9.1 The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states users are accessing a U.S. Government information system. (1.1 Access Control, AC-8a)
3.3
13.1.9.2 The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states system usage may be monitored, recorded, and subject to audit. (1.1 Access Control, AC-8a)
3.3
13.1.9.3 The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states unauthorized use of the system is prohibited and subject to criminal and civil penalties. (1.1 Access Control, AC-8a)
3.3
13.1.9.4 The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states use of the system indicates consent to monitoring and recording. (1.1 Access Control, AC-8a)
3.3
13.1.9.5 The NARA Catalog system shall retain the notification message or banner on the screen until users take explicit actions to log on to or further access the information system. (1.1 Access Control, AC-8b)
3.3
13.1.9.6 The NARA Catalog system shall display the system use information when appropriate, before granting further access. (1.1 Access Control, AC-8c)
3.3
13.1.9.6.1 The NARA Catalog system shall display references, if any, to monitoring, recording, or auditing that are consistent with privacy accommodations for such systems that generally prohibit those activities; and Include in the notice given to public users of the information system, a description of the authorized uses of the system. (1.1 Access Control, AC-8c)
3.3
13.1.10 The NARA Catalog system shall be capable of auditing successful and unsuccessful account logon events, account management events, object access, policy change, privilege functions, process tracking, and system events. (1.3 Audit and Accountability, AU-2)
3.4
13.1.11 The NARA Catalog system shall be capable of auditing all administrator activity, authentication checks, authorization checks, data deletions, data access, data changes, and permission changes. (1.3 Audit and Accountability, AU-2)
3.4
13.1.12 The NARA Catalog system shall produce audit records that contain sufficient information to, at a minimum, establish what type of event occurred, when (date and time) the event occurred, where the event occurred, the source of the event, the outcome (success or failure) of the event, and the identity of any user/subject associated with the event. (1.3 Audit and Accountability, AU-3)
3.4
16
NARA Catalog System Design
13.1.12.1 The NARA Catalog system shall produce audit records for data requiring moderate or high integrity, the information system shall include the date and time of the event; the component of the information system (e.g., software component, hardware component) where the event occurred; type of event; subject identity; and the outcome (success or failure) of the event.] in the audit records for audit events identified by type, location, or subject. (1.3 Audit and Accountability, AU-3 (1))
3.4
13.1.13 The NARA Catalog system shall allocate audit record storage capacity and configure auditing to reduce the likelihood of such capacity being exceeded. (1.3 Audit and Accountability, AU-4)
3.4
13.1.14 The NARA Catalog system shall alert designated NARA officials in the event of an audit processing failure. (1.3 Audit and Accountability, AU-5a)
3.4, 6
13.1.14.1 The NARA Catalog system shall overwrite the oldest audit records after an audit processing failure, for low or moderate integrity information systems. (1.3 Audit and Accountability, AU-5b)
3.4
13.1.15 The NARA Catalog system shall provide an audit reduction and report generation capability. (1.3 Audit and Accountability, AU-7)
3.4
13.1.16 The NARA Catalog system shall provide the capability to automatically process audit records for events of interest based on selectable event criteria. (1.3 Audit and Accountability, AU-7(1))
3.4
13.1.17 The NARA Catalog system shall use internal system clocks to generate time stamps for audit records. (1.3 Audit and Accountability, AU-8)
3.6
13.1.18 The NARA Catalog system shall synchronize internal information system clocks [or at least every 24 hours] with [NARA’s authoritative time source]. (1.3 Audit and Accountability, AU-8 (1))
3.6
13.1.19 The NARA Catalog system shall protect audit information and audit tools from unauthorized access, modification, and deletion. (1.3 Audit and Accountability, AU-9)
3.4
13.1.19.1 The NARA Catalog system shall provide the capability to log actual and attempted machine access to the audit log. (1.3 Audit and Accountability, AU-9)
3.4
13.1.20 The NARA Catalog system shall provide audit record generation capability for the list of auditable events defined in AU-2. (1.3 Audit and Accountability, AU-12a)
3.4
13.1.20.1 The NARA Catalog system shall allow designated NARA personnel to select which auditable events are to be audited by specific components of the system. (1.3 Audit and Accountability, AU-12b)
3.4
13.1.20.2 The NARA Catalog system shall generate audit records for the list of audited events defined in AU-2 with the content as defined in AU-3. (1.3 Audit and Accountability, AU-12c)
3.4
13.1.20.3 The NARA Catalog system shall capture error logs from COTS products. (1.3 Audit and Accountability, AU-12)
2.1.2.4
13.1.20.4 The NARA Catalog system shall capture Operating System errors. (1.3 Audit and Accountability, AU-12)
3.4
13.1.20.5 <Allocated to NARA Catalog Application Server Design
17
NARA Catalog System Design
13.1.20.6 The NARA Catalog system shall co-locate COTS error logs from different locations to a common storage location. (1.3 Audit and Accountability, AU-12)
2.1.2.4
13.1.20.7 <Allocated to NARA Catalog Ingestion Design, NARA Catalog Application Server Design, NARA Catalog Search Engine Design>
13.1.20.8 The NARA Catalog system shall provide error detection when accessing memory via parity and/or hardware register checking, as available by the cloud environment selected by the government for hosting the NARA Catalog servers. (1.3 Audit and Accountability, AU-12)
3.2
13.1.21 The NARA Catalog system shall implement configuration settings for information technology products employed within the information system using [Security Architecture security configuration checklists approved and published by NARA IT Security Staff (NHI)] that reflect the most restrictive mode consistent with operational requirements. (1.5 Configuration Management, CM-6)
3
13.1.21.1 <Allocated to NARA Catalog Application Server Design for MySQL and JBoss Configuration>
13.1.22 The NARA Catalog system shall use the Center for Internet Security guidelines (Level 1) to disable ports, protocols, and/or services identified in the configuration guides. (1.5 Configuration Management, CM-7)
3.5
13.1.23 The NARA Catalog system shall provide the capability for backup of the system. (1.6 Contingency Planning, CP-9)
5.1
13.1.24 The NARA Catalog system shall provide the capability for the backup of COTS product files as required to restore operational capability. (1.6 Contingency Planning, CP-9)
5.1
13.1.25 The NARA Catalog system shall provide the capability for the backup of application files as required to restore operational capability. (1.6 Contingency Planning, CP-9)
5.1
13.1.26 The NARA Catalog system shall provide the capability for the backup of configuration support files as required to restore operational capability. (1.6 Contingency Planning, CP-9)
5.1
13.1.27 The NARA Catalog system shall provide the capability for the backup of the files listed in the NARA Catalog Administration Guide, Section 5. (1.6 Contingency Planning, CP-9)
5.1
13.1.28 The NARA Catalog system shall provide the capability to cancel a scheduled backup process subject based on permissions. (1.6 Contingency Planning, CP-9)
5.1
13.1.29 The NARA Catalog system shall provide the capability to cancel a manual backup process subject based on permissions. (1.6 Contingency Planning, CP-9)
5.1
13.1.30 The NARA Catalog system shall provide the capability to recover the system. (1.6 Contingency Planning, CP-10)
5.2, 5.3
13.1.31 The NARA Catalog system shall provide the capability to recover COTS product files to restore operational capability. (1.6 Contingency Planning, CP-10)
5.2, 5.3
13.1.32 The NARA Catalog system shall provide the capability to recover application files to restore operational capability. (1.6 Contingency Planning, CP-10)
5.2, 5.3
18
NARA Catalog System Design
13.1.33 The NARA Catalog system shall provide the capability to recover configuration support files to restore operational capability. (1.6 Contingency Planning, CP-10)
5.2, 5.3
13.1.34 The NARA Catalog system shall provide the capability to recover from a hardware failure. (1.6 Contingency Planning, CP-10)
5.2, 5.3
13.1.35 The NARA Catalog system shall provide the capability to recover from a physical site outage. (1.6 Contingency Planning, CP-10)
5.2, 5.3
13.1.36 The NARA Catalog system shall uniquely identify and authenticate users (or processes acting on behalf of users). (1.7 Identification & Authentication, IA-2)
3.3
13.1.37 The NARA Catalog system shall protect authenticator content from unauthorized disclosure and modification. (1.7 Identification & Authentication, IA-5)
3.3
13.1.38 The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, enforce minimum password complexity of [a case sensitive, 8-character mix of upper case letters, lower case letters, numbers, and special characters, including at least one of each]. (1.7 Identification & Authentication, IA-5(1))
3.3
13.1.38.1 The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, enforce at least a [four character change] when new passwords are created. (1.7 Identification & Authentication, IA-5(1))
3.3
13.1.38.2 The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, encrypt passwords in storage and in transmission. (1.7 Identification & Authentication, IA-5(1))
3.3
13.1.38.3 The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, enforce password minimum and maximum lifetime restrictions of [1 day minimum, 90 day maximum]; and Prohibit password reuse for [a minimum of 5 for unclassified information systems] generations.(1.7 Identification & Authentication, IA-5(1))
3.3
13.1.38.4 <Not allocated to system design pertains to public users only>13.1.39 The NARA Catalog system shall obscure feedback of authentication information
during the authentication process to protect the information from possible exploitation/use by unauthorized individuals. (1.7 Identification & Authentication, IA-6)
3.3
13.1.40 The NARA Catalog system shall use mechanisms for authentication to a cryptographic module that meet the requirements of applicable federal laws, Executive Orders, directives, policies, regulations, standards, and guidance for such authentication. (1.7 Identification & Authentication, IA-7) NARA Guidance: This requirement means that cryptographic modules used for identification and authentication must meet FIPS 140-2 standards."
3.3
13.1.41 TheNARA Catalog system shall uniquely identify and authenticate non-NARA users (or processes acting on behalf of non-NARA users). (1.7 Identification & Authentication, IA-8)
3.3
19
NARA Catalog System Design
13.1.42 The NARA Catalog system shall separate user functionality (including user interface services) from information system management functionality. (System and Communications Protection, SC-2) Supplemental Guidance: Information system management functionality includes, for example, functions necessary to administer databases, network components, workstations, or servers, and typically requires privileged user access. The separation of user functionality from information system management functionality is either physical or logical and is accomplished by using different computers, different central processing units, different instances of the operating system, different network addresses, combinations of these methods, or other methods as appropriate. An example of this type of separation is observed in web administrative interfaces that use separate authentication methods for users of any other information system resources. This may include isolating the administrative interface on a different domain and with additional access controls."
3.3
13.1.43 "The NARA Catalog system shall prevent unauthorized and unintended information transfer via shared system resources. (System and Communications Protection, SC-4) Supplemental Guidance: The purpose of this control is to prevent information, including encrypted representations of information, produced by the actions of a prior user/role (or the actions of a process acting on behalf of a prior user/role) from being available to any current user/role (or current process) that obtains access to a shared system resource (e.g., registers, main memory, secondary storage) after that resource has been released back to the information system. Control of information in shared resources is also referred to as object reuse. This control does not address: (i) information remanence which refers to residual representation of data that has been in some way nominally erased or removed; (ii) covert channels where shared resources are manipulated to achieve a violation of information flow restrictions; or (iii) components in the information system for which there is only a single user/role."
3.3
13.1.44 The NARA Catalog system shall monitor and control communications at the external boundary of the system and at key internal boundaries within the system. (System and Communications Protection, SC-7a)
2.1.4
13.1.44.1 The NARA Catalog system shall connect to external networks or information systems only through managed interfaces consisting of boundary protection devices arranged in accordance with the NARA security architecture. (System and Communications Protection, SC-7b)
2.1.4
13.1.44.2 The NARA Catalog system shall configure external firewalls to permit only the minimum protocols through that are required for the system to function. (System and Communications Protection, SC-7)
2.1.4
13.1.44.3 The NARA Catalog system shall configure external firewalls to ignore external ICMP 'echo' requests to the system. (System and Communications Protection, SC-7)
2.1.4
20
NARA Catalog System Design
13.1.44.4 The NARA Catalog system shall configure external firewalls to ignore external UDP 'chargen' requests to the system. (System and Communications Protection, SC-7)
2.1.4
13.1.45 The NARA Catalog system shall protect the integrity of transmitted information. (System and Communications Protection, SC-8)
3.7, 4.2.3
13.1.46 The NARA Catalog system shall employ cryptographic mechanisms to recognize changes to information during transmission. (System and Communications Protection, SC-8 (1))
3.7, 4.2.3
13.1.47 The NARA Catalog system shall protect the confidentiality of transmitted information. (System and Communications Protection, SC-9)
3.7, 4.2.3
13.1.48 The NARA Catalog system shall employ cryptographic mechanisms to prevent unauthorized disclosure of information during transmission. (System and Communications Protection, SC-9(1))
3.7, 4.2.3
13.1.49 "The NARA Catalog system shall terminate the network connection associated with a communications session at the end of the session or after no more than 30 minutes of inactivity for a backend user. (System and Communications Protection, SC-10) SC-10 Guidance: Long running batch jobs and other necessary operations are not subject to this time limit.
2.1.4
13.1.50 <Not allocated to system design pertains to public users only>13.1.51 The NARA Catalog system shall implement required cryptographic protections
using cryptographic modules that comply with applicable federal laws, Executive Orders, directives, policies, regulations, standards, and guidance. (System and Communications Protection, SC-13) NARA Guidance: This requirement means that any cryptographic modules used must meet FIPS 140-2 standards."
3.7, 4.2.3
13.1.52 The NARA Catalog system shall protect the integrity and availability of publicly available information and applications. (System and Communications Protection, SC-14)
4
13.1.53 The NARA Catalog system shall prohibit remote activation of collaborative computing devices (if collaborative computing mechanisms are used). (System and Communications Protection, SC-15)
3
13.1.54 The NARA Catalog system shall provide mechanisms to protect the authenticity of communications sessions. (System and Communications Protection, SC-23)
3.7, 4.2.3
13.1.55 The NARA Catalog system shall protect the confidentiality and integrity of information at rest. (System and Communications Protection, SC-28)
4, 5.1
13.1.56 <Allocated to NARA Catalog Application Server Design>13.1.57 <Allocated to NARA Catalog Application Server Design>
21
NARA Catalog System Design
2 Hardware and Network Design
This section covers the anticipated hardware and network required to meet NARA Catalog production initial system requirements
2.1 Production System
2.1.1 Assumptions
This production system is scaled to meet the following stated requirements:
500 million digital objects
30 million records (20 million descriptions, 10 million authorities)
2000 sustained concurrent users, 20,000 peak
Further, we assume that every digital object is a separate digital object file as specified in the current system with an <object> tag in the archival description.
See section 2.5 for an example for how to compute 2014 and 2015 NARA Catalog Prod requirements for smaller volumes or a mixture of different types of index entries.
2.1.2 Server Hardware
The following diagram shows all of the hardware servers and networks anticipated for the NARA Catalog Production system.
22
NARA Catalog System Design
OPA Production
Search Engine Array
Storage
Search Engine Array
Content Processing
& FTP
Internet
DAS
2 (capacity) 25 + 25 (failover & QPS)
Database
1 + 1 (failover)
ApplicationServers
Internet
3 (capacity) + 1 (failover) +
1 (bulk exports)
OPAStorage
FTPClient
Reporting, Monitoring & Admin Control
1 + 1 (failover)
All standard server machines are expected to be modern machines with the minimum characteristics:
2.5mb processor cache per core
3 Ghz CPU clock speed or better
SAS (Serial Attached SCSI) Hard Drives
o Note: SATA drives will not provide sufficient I/O bandwidth capacity for NARA Catalog applications.
o SAN storage is also a viable option, as long as IO operations/second are sufficiently capable
RAID 1, RAID 5, or RAID 10 for all hard drives
Disks for servers must not be shared
o The architecture is designed to be a “share nothing” system
o The only shared storage is NARA Catalog Storage
o All hard disk drives on each machine must be dedicated spindles.
o This is especially critical for servers listed with IOPS of “high” below
23
NARA Catalog System Design
Specific requirements on RAM and number of processing cores per server are identified below:
System Purpose Cnt RAM Cores HD IOPS Comments
database primary 1 122gb 16 250gb
high MySQL Server
database failover 1 122gb 12 250gb
high MySQL Server
Content Processing
primary 2 30gb 8 2tb med Content processing & SFTP server. Note that two servers provide capacity.Failure of one server will reduce ingestion capacity.
Search Engine
primary 25 60.5gb 16 1tb high Solr Search Search servers for 530 million medium-sized index entries divided 25 ways.
Search Engine
failover 25 60.5gb 16 1tb high Failover row, holds index replicas for primary row.
Web Application
Primary 4 30gb 16 1tb low Holds application servers to handle API requests from end-user interfaces. 4 servers are recommened for load balancing and fail over.Disk space is for holding 1yr of log data.
Bulk Export primary 1 30gb 8 100gb
low Server to process bulk exports in background. Output is written to NARA Catalog Storage.
Reporting, Monitoring, Server Control
primary 1 30gb 16 2tb low Holds the reporting application, Zookeeper server management, and system monitoring tools. May hold SFTP server as well.Disk space is for holding 2yrs of log data for reporting functions.
Reporting, Monitoring, Server Control
failover 1 30gb 16 2tb low Failover server for admin functions.Disk space is for holding 2yrs of log data for reporting functions.
Note:
“Cnt” is the number of servers for the specified configuration
RAM, Cores, and Hard Disk are “per server” values.
Database recommendations come from this sizing guide.
24
NARA Catalog System Design
Hard Disk Drive Provisioning
The hard disk numbers above identify different IOPS (I/O Operations Per Second):
high – 1000-2000 IOPS
medium – 250-500 IOPS
low – 100-250 IOPS
2.1.2.1 Disk Space for Database Servers
The following spreadsheet provides a very rough estimate of the disk space required for the users and annotations table.
Notes:
Requirements estimate 1,000,000 users. Estimates for number of transcriptions, translations, tags, etc, are based on this estimate.
Number of bytes of data per row and for indexes are estimated based on current table designs
A multiplier of 4x is provided for expansion in the MySQL INODB database structure.
Bytes per row
Data Indexes Count Total (gb)transcriptions 443 100 1,000,000 2.02 translations 443 100 500,000 1.01 tags 148 50 10,000,000 7.36 comments 328 50 2,000,000 2.81 annotations log 563 200 15,600,000 44.25 accounts 274 100 1,000,000 1.39 Total 58.85
Since these estimates are very rough, a total disk space of 250gb per server is recommended to provide a 5x buffer for growth in requirements or mis-calculations.
2.1.2.2 Disk Space for Content Processing Servers
The content processing servers will maintain local cache files for the following:
All original DAS XML files
o Total number of DAS XML files expected: (initial configuration)20 million (descriptions) + 10 million (authority records) = 30 million total
o Average DAS XML size: ~10kb
25
NARA Catalog System Design
o Total required disk space: 10,000 * 30,000,000 = 300gb
A copy of all ARC XML files
o Total number of ARC XML files expected: (initial configuration)20 million (descriptions) + 10 million (authority records) = 30 million total
o Average ARC XML size: ~10kb
o Total required disk space: 10,000 * 30,000,000 = 300gb
Database cache of parent records and counts
o Parent records: 20,000,000 * 0.25 = 5 million
About 25% of DAS descriptions are a parent record
o Authority Records: 10 million
o Total records in the database cache: 10 million + 5 million = 15 million
o Expected bytes per record: 1000
o Total diskspace required: 15,000,000 * 10000 = 15gb
Total disk space estimated: 300gb + 300gb + 15gb = 615gb
Recommended disk space: 2tb to account for expansion of requirements and unexpected growth
2.1.2.3 Search Engine Sizing
Size per Index Entry
Each index entry is relatively verbose:
Entire ARC XML description = 10K bytes / entry
Technical metadata for each digital object = 1K bytes / entry
Extracted text content = 4K bytes / entry
o Note that even though most entries are images (with no extracted text content), PDF files are typically provided which contain OCR text for all of the images.
o Therefore, the average of 4K bytes / entry holds
Additional metadata fields: 2K bytes / entry
Total size: 17K bytes / entry
Note: These index sizes are substantially larger than the current OPA Pilot system because the full XML description and full object metadata XML is indexed with every index-entry, as is required to handle the API use cases discussed with NARA.
26
NARA Catalog System Design
Index Entries / Node
The general consensus for Search Engine indexes are:
Small Documents (1K-5K / entry): 50 million index entries / node
Medium Documents (5K-50K / entry: 25 million index entries / node
Large Documents (50K- / entry): 10 million index entries / node
Therefore, Search Technologies recommends sizing each machine at around 25 million index entries on each node.
Total number of index entries
Total number of records to be indexed into NARA Catalog:
20 million archival descriptions
10 million authority records
500 million digital objects
Total: 530 million index entries
Total Number of Servers
Based on the above estimates, the total number of servers recommended will be:
530,000,000 / 25,000,000 = 21 servers
Rounding up = 25 servers
Two replicas for query performance and scalability
Total servers: 25 * 2 = 50 servers
Index Space
The storage required for each search engine server is computed as follows:
Total data content = 530 million entries * 17K bytes / entry = 9.1tb
Index content = 9.1tb (same as content size)
Total disk required: 18.2tb
Round up: 20tb
Disk per server: 20tb / 25 servers = 800gb / server
Recommended disk space per server: 1tb / server
27
NARA Catalog System Design
2.1.2.4 Disk Space for Application Servers and Reporting Servers
In order to provide the reports required by the NARA Catalog reporting requirements, every API access will be recorded in log files.
An estimate of API accesses include:
60 queries / second “sustained” usage (Rqmt 10.4)
“Normal traffic” of 2,000 concurrent users (Rqmt 10.6)
o Assuming API calls of 1 call per every 10 seconds per user provides 200 API calls / second
Total: 260 API calls / second
Assuming each API call requires about 100 bytes
Disk Usage: 26,000 bytes / second = 2.1gb / day of logs generated
Hold 1 year worh of logs: 2.1gb * 365 = 766.5gb of logs = 1tb of disk space (rounded up)
2.1.3 NARA Catalog Storage Hardware
NARA Catalog Storage requirements can be computed in several ways:
Requirement 12.1.1: 10,000tb of storage (10 petabytes)
Compute space for 500 million digital objects (requirement 12.1.2.1):
o Current space = 8.4tb for 1.6 million digital objects
o Scaling up: 8.4 * 500 / 1.6 = 2,625 tb (2.6 petabytes)
Current space for all images + digitization partner images: ~85tb
2.1.4 Network Hardware
Network hardware required includes:
Load balancer: Internet Application Servers
o Load balance requests from the internet across 4 application servers
o The load balancer to the internet will provide boundary control to external systems (Rqmt 13.1.44, 13.1.44.1)
o It will be configured to only allow minimum protocols (Rqmt 13.1.44.2), namely HTTP and HTTPS to the application server.
o The external iCMP ‘echo’ request will be ignored (Rqmt 13.1.44.3), as will external UDP ‘chargen’ requests (13.1.44.4)
28
NARA Catalog System Design
Router / Firewall
o For ingesting new content via SFTP (push)
o For registering new updates from DAS (pull)
o The router to NARANet will provide boundary control to NARANet systems (Rqmt 13.1.44, 13.1.44.1)
o It will be configured to only allow minimum protocols (Rqmt 13.1.44.2), namely HTTP / HTTPS to/from DAS, and SFTP to NARA Catalog storage.
o The external iCMP ‘echo’ request will be ignored (Rqmt 13.1.44.3), as will external UDP ‘chargen’ requests (13.1.44.4)
2.1.4.1 Network Layout
The recommended network layout is shown in the following diagram:
Private Sub-Net APrivate Sub-Net A
Search ServersPublic Sub-NetPublic Sub-Net
Application ServersDatabase Servers OPA Storage
Content Processing Reporting andAdmin Control
Private Sub-Net BPrivate Sub-Net B
AWSNetwork
DAS
FTPClient
Internetgateway internetgateway
CustomRouteTable
CustomRouteTable
CustomRouteTable
Router
With the above architecture, routes are carefully controlled to provide as much isolation from internet traffic as possible. In the above diagram the arrow represents “allowed inbound traffic”. Arrows for SSH for system administration are not shown.
The specific routes and network-security required will include:
1. Internet traffic application servers (HTTP / HTTPS).
2. Application servers Private Sub-Net A:a. Access to NARA Catalog Database Servers (Read/Write)
b. Access to Search Servers (read only)i. The search servers will be configured to limit application servers to the “/select”
URL path.
29
NARA Catalog System Design
c. Access to NARA Catalog Storagei. Configured using NFS mount on the Application Servers.ii. Application Servers Digital Objects (read only)iii. Application Servers Bulk-Export Area (read/write)
3. Content Processing / Reporting & Admin Control Search, Database, NARA Catalog Storagea. The servers in private subnets A and B will be able to access each other as needed:
i. Database Content processing (read only)ii. Content Proessing Search (write only)iii. Content Processing NARA Catalog Storage (read/write)iv. Reporting & Admin Control All Servers (read/write)
4. Servers in NARANet will need to be connected to select servers in NARA Catalog:a. DAS Content Processing (read only)
b. SFTP NARA Catalog Storage (read/write to select directories)
2.2 Sandbox EnvironmentThe following is the system diagram for the sandbox system:
OPA Sandbox
Search Engine ArrayContent Processing
1
4
ApplicationServers
1
Storage
OPAStorage
Internet
DAS
FTPClient
routerrouterrouterrouter
Internet
See section 2.1.2 above for details on the types of server machines required.
Specific requirements on RAM and number of processing cores per server for the sandbox environment are identified below:
System Purpose Cnt RAM Cores HD IOPS Comments
Content Processing
primary 1 30gb 8 2tb med Content processing & SFTP server.
Search Engine
primary 4 60.5gb
16 1tb high Solr Search Search servers for 100 million index entries.
30
NARA Catalog System Design
Application primary 1 30gb 16 100gb low Holds application servers to handle API requests from end-user interfaces.
The details of the sizing and disk space required per machine are the same as in section 2.1.
2.3 Development SystemThe system diagram for the development system is shown below:
OPA Development
Search Engine Array
Content Processing
AWSNetworkDAS
Database ApplicationServers
loadballance
loadballance
Internet
Reporting, Monitoring,Admin Control 1
1
1
Storage
OPAStorage
routerrouter
2
4
Specific requirements on RAM and number of processing cores per server for the development environment are identified below:
System Purpose Cnt RAM Cores HD IOPS Comments
database primary 1 122gb 16 250gb med MySQL Server
Content Processing
primary 1 30gb 8 2tb med Content processing & SFTP server. Note that two servers provide capacity.Failure of one server will reduce ingestion capacity.
Search Engine
primary 4 60.5gb
16 1tb med Solr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs dictate.
Application primary 1 30gb 16 500gb low Holds application servers to handle API requests from end-user interfaces.
Application failover 1 30gb 16 500gb low Additional application server for testing session data persistence
31
NEW_UAT
PROD OLD_PROD
PROD
Release
Two systems are required for only for a 3 month window around releseae.
Launch NEW_UAT Environment
Start UAT Test
PROD burn-in complete. Shut down OLD_PROD
NARA Catalog System Design
across multiple servers.
Reporting, Monitoring, Server Control
primary 1 30gb 8 1tb low Holds the reporting application, Zookeeper server management, and system monitoring tools. May hold SFTP server as well.
The details of the sizing and disk space required per machine are the same as in section 2.1.
2.4 UAT SystemFor each new release of NARA Catalog, a UAT system will be required. It is recommended that this system be substantially the same as the production system shown above in section 2.1.
If a true, elastically scalable cloud environment is available, Search Technologies recommends provisioning the UAT system only “as needed”, around major release dates. This is shown in the following diagram:
2.4.1 UAT to PROD Proceedure
The recommended process for fielding a new UAT system is as follows:
5. Two months before “go live”, launch a new set of virtual machines in the configuration shown in section 2.1 NEW_UAT
a. This system should be the same configuration as shown in 2.1.
6. Deploy a completely new version of NARA Catalog to the NEW_UAT system.
7. Migrate data to NEW_UAT as needed.
32
NARA Catalog System Design
a. Restore backups to NEW_UAT.
b. Reprocess updates since backup was made.c. This should *not* require a new copy of NARA Catalog Storage.
i. Instead, NEW_UAT will operate on a “test packages” area.ii. Packages will be copied from the production area and modified as necessary.
8. Complete UAT test on NEW_UAT.
9. When the new system is ready to go live to production, perform a final system validation:
a. Complete a final backup restoreb. Reprocess updates since backup was made
c. Complete a final system validation test
10. Put NEW_UAT online.
a. Route requests from http://catalog.archives.gov Now goes to NEW_UATb. NEW_UAT now becomes PROD
c. PRODUCTION now becomes OLD_PROD
11. Monitor and validate PROD to ensure smooth operation.
12. If there is a fatal problem with NEW_UATa. Restore: OLD_PROD PROD
b. Fix the problem.c. Return to step 4 above.
13. Once PROD (formerly NEW_UAT) is safe and running smoothly (past the burn-in period):a. Shut down OLD PROD.
b. Release the virtual machines back to the cloud.
2.5 Example 2014 and 2015 NARA Catalog Prod ComputationsThis section covers an example of how 2014 and 2015 server requirements could be computed.
Note: This information is based on data sets known to Search Technologies which will need to be migrated into NARA Catalog Production in 2014 and 2015.
Naturally, Search Technologies is not aware of all of the potential data migrations into NARA Catalog Production which are planned for 2014. Therefore, this section can be viewed as merely as an example of how server requirements could be scaled down should 2014 and 2015 be less than as specified in the NARA Catalog requirements spreadsheet.
2.5.1 Example Server Requirements
The current OPA Pilot system has the following characteristics:
33
NARA Catalog System Design
Current digital objects: 1.6 million
Current archival descriptions: 8 million
Current authority records: 1.05 million
Expected growth for calendar year 2015:
EOP Packages: 360,000 (15,000 messages x 24 months)
Digital partner objects: 12 million
Based on the above information, Search Technologies believes that 50 servers for search may overestimate the requirements for NARA Catalog Production for Calendar years 2014 and 2015.
Based on the above requirements, the total number of index entries could rise to:
Total index entries (based on above estimates):
o Archival descriptions: 8 million + 25% = 10 million
o Authority records: 1.05 million + 25% = 1.35 million
o Digital objects: (1.6 million + 25%) + 12 million = 14 million
o EOP Packages: 360,000 x 2 (for description and digital object) = 0.72 m
o Total index entries: 26 million
To handle 26 million index entries, the following server counts could be modified:
Reduce search engines: 50 4
Additional servers may be reduced depending on the rate of adoption of NARA Catalog Production:
How much will the APIs be used?
How many simultaneous users will NARA Catalog Production have in 2014 and 2015?
Current OPA Pilot usage is relatively light (0.1 QPS average, 7 QPS peak). Given historical usage, the Application servers may be reduced to 2 (from 4) and the bulk export server may be co-located with the “Reporting, Monitoring, and Server Control” Servers, leading to future reductions in hardware requirements for 2014 and 2015.
2.5.2 Elastic Scalability
Depending on the time it takes to provision new hardware (a function of the cloud environment), reducing hardware for 2014 and 2015 could be a relatively “safe” option, for the following reasons:
Adding search server rows for additional QPS will require about 2 weeks
o Machine instances can be created to launch servers quickly.
34
NARA Catalog System Design
o Servers can be added as new “slave replicas” with simple configuration
o 2 weeks would be required for initial index replication, testing, and to account for possible roll-backs and re-attempts should something go wrong.
Adding additional index partitions for additional content will require about 6 weeks
o Machine instances can be created to launch servers quickly.
o Servers can be added as new “partitions” with simple configuration
o 6 weeks would be required to re-balance the documents across the partitions, which may require
Adding new application servers for additional end-user capacity will require about 3 days
o New application servers can be added at any time
o No complex data replications are required (they all share a master database)
o All API and UI transactions are stateless (state is carried in cookies and on the client)
Note that these times (2 weeks for a search row, 6 weeks for additional index partitions, 3 days for additional application servers) could be reduced with additional testing, scripting, and process documentation.
2.5.3 Unknowns
There are a number of unknowns in the calculations above which could cause the systems for 2014 and 2015 to be substantially larger. Specifically:
Will all of AAD be indexed as granules? 105 million records
o Note: Depending on API requirements, these could be “small” records.
o Many more “Small” records can be packed into a single server (as many as 50 million, instead of the 25 million recommended for “medium” records)
Will every name in the 1940 census be indexed as granules? 130 million records
o Note: Depending on API requirements, these could be “small” records. See above.
What other major initiatives will be required?
2.5.4 Computing Server Requirements for Index Entries of Varying Size
For index partition computations, a general understanding of the size of the index entry is required. For the purposes of NARA, documents can be classified as “small”, “medium”, and (possibly) “large”, as follows:
35
NARA Catalog System Design
Small
o A single row from a database table
o A half-page of text
Medium
o Anything with <archival-description> XML is automatically medium or larger
o XML metadata for multiple objects
o One or two pages of text
Large
o Over 25 pages of text
Computations involving “small”, “medium”, and large index entries should be based on the following formula:
Index partitions = (number of small entries)/50 million + (number of medium entries)/25 million + (number of large entries)/10 million
Total servers = (index-partitions) * (replicas)
Currently we expect replicas = 2 to handle the QPS rates required by NARA Catalog.
Note: This formula only works if the small entries are truly small. For example, indexing the entire <archival-description> XML with every small entry will automatically turn all index entries to “medium” size.
For example, if all of AAD and if all names in the 1940 census are indexed as “small” entries, then the following computations hold:
Medium entries (from previous sub-sections): 26 million
Small entries (from AAD & 1940 census): 234 million
Index partitions: (26 million/25 million) + (234 million / 50 million) = 5.72
o Round up: 6
Total search engine servers: 6 * 2 = 12
36
NARA Catalog System Design
3 Operating System Design
The operating system recommended for NARA Catalog is Red Hat Linux – or similar. Red Hat is on the NARA TRM as a recommended Linux variant.
3.1 Kernel ConfigurationThe kernel configuration will be as delivered. No kernel customizations are required.
3.2 Memory ConfigurationThe initial memory configuration will be based on default values for Linux.
Optimal kernel memory parameters (such as shmmax, file-max, swappines, Huge memory pages, etc.) will be determined based on search engine and MySQL performance tuning, as needed.
Modified parameters as needed to achieve required performance will be documented in the administration guide.
Parameters required for parity checking (Rqmt 13.1.20.8) will be configured as well. [TBD – Requires help from NARA security team to determine correct parameters]
3.3 AccountsUnix accounts will be managed as follows:
Guest accounts for COTS will be disabled. (Rqmt 13.1.1)
Separate accounts for server processes (Rqmt 13.1.7, 13.1.42)
Separate accounts for operating system account management (Rqmt 13.1.7, 13.1.42)
Login attempts will be limited to a maximum of 5 consecutive invalid attempts by a user during a 15 minute period (Rqmt 13.1.8)
o The NARA Catalog system shall automatically [locks the account/node for at least 15 minutes ] when the maximum number of unsuccessful attempts is exceeded. (13.1.8.1)
The NARA Catalog system shall display an approved system use notification message before granting access. (Rqmts 13.1.9, 13.1.9.1, 13.9.1.2, 13.1.9.3, 13.1.9.4, 13.1.9.5, 13.1.9.6, 13.1.9.6.1)
o [TBD – need “approved system use notification message” from NARA for linux accounts]
37
NARA Catalog System Design
Enforce minimum password rules, including:
o Case sensitive, 8-character mix of upper case letters, lower case letters, numbers and special characters including at least one of each (Rqmt 13.1.38)
o Enforce at least a four character change when new passwords are created (Rqmt 13.1.38.1)
o Require password encryption (means requiring SSH for system access) (Rqmt 13.1.38.2)
o Enforce password minimum and maximum lifetime restrictions (1 day minimum, 90 day maximum) and prohibit password reuse for a minimum of 5 generations (Rqmt 13.1.38.3)
o Shall use FIPS 140-2 standards for cryptographic modules (Rqmt 13.1.40)
3.4 AuditingAuditing will be done with the help of the NARA security team, and based on NARA Linux recommended configurations.
We anticipate that this will include:
Administrator auditing with the “psacct” package and perhaps other packages (such as rootsh logging) (Rqmts 13.1.10, 13.1.11)
Appropriate configuration of syslogd, including auditing of successful and unsuccessful account events (Rqmts 13.1.10, 13.1.12)
Verification of logs generated to /var/log/security and /var/log/audit/audit.log
Protection of logs from unauthorized modification (Rqmts 13.1.19, 13.1.19.1)
Capturing operating system errors (Rqmt 13.1.20.4)
[TBD – Details will require standard recommended configurations from NARA security team]
3.5 Ports ConfigurationAll ports will be initially “turned off” for all servers. Then ports will be individually turned on as required for inter-process and external communications to ensure that only the minimum number of ports are enabled. (Rqmts 13.1.22)
3.6 Clock SynchronizationThe Linux systems will be configured for internal system clocks for auditing, and for synchronizing clocks with NARA’s authoritative time source [TBD – what is NARA’s authoritative time source?]. (Rqmts 13.1.17, 13.1.18)
38
NARA Catalog System Design
3.7 SSHOnly SSH (Secure Shell) will be allowed into NARA Catalog servers for system administration.
3.8 Maintaining and Patching the Operating SystemThe operating system will be maintained and patched using the cloud-recommended procedures.
This may involve:
Halt updates to the system.
o For example, turn of index updates by the ingestion servers.
o This will limit the amount of data which is changing. Most servers will now have “idle” systems with files that do not change.
Taking the server to be patched off-line.
Patching the operating system as necessary.
Bringing the server back on-line.
Re-synchronize database files as necessary.
o If the ingestion servers are idle, then search engine indexes, application servers, and ingestion servers will not require re-synchronization.
o Therefore, only the RDBMS may still be receiving updates that require synchronization, when a database server is taken off-line, patched, and then brought back on-line.
All critical servers have fail-over siblings which will allow for either one or the other to be brought off-line for patching as necessary.
39
NARA Catalog System Design
4 Storage Design
This section covers the design of “NARA Catalog Storage”.
This section does not cover the disk required for each individual server (see section 2.1 for more information about individual server disk).
4.1 Storage Technology for NARA Catalog ProdThe storage technology to be used for NARA Catalog Prod will need to be discussed with the cloud provider. The following are expected storage technologies based on the provider chosen.
4.1.1 Version 1
It is expected that the technology of NARA Catalog storage for NARA Catalog Prod, version 1 will be a simple, mounted disk drive.
The underlying technology will depend on the cloud environment chosen:
For FDC – this will be a NetApp disk
For Amazon Cloud – this will be Elastic Block Storage (EBS)
For other cloud systems – this will be standard mounted disk drives.
Note: Is using a cloud system, an additional server may be required to service NFS requests from all other NARA Catalog server machines.
The exact storage mechanism will need to be discussed with the cloud provider – once this is determined by NARA.
Depending on the cloud environment and the storage options provided, additional tasks may be required to achieve the high IOPS required by the database and the search engine servers for NARA Catalog. This may include:
Striping the disks for higher I/O Performance
Having many separate volumes instead of a small number of very large volumes
Using different types of disks (i.e. “Provisioned” storage vs “Standard” storage)
The exact configuration and disk mounting steps will be determined once the cloud environment and storage technology are determined.
40
NARA Catalog System Design
Server Access to NARA Catalog Storage
NARA Catalog Storage will be NFS mounted to all of the servers that require access to it. This includes:
Ingestion servers (read/write)
Application servers (read/write)
Reporting and server management servers (read only access)
Bulk export server (read/write)
For more fine-grained security controls, see below.
Note that the search engine servers and database servers will not require access to NARA Catalog Storage.
4.1.2 Version 2
Again, depending on the cloud provider, NARA Catalog will use a shared high-volume cloud storage technology, specifically, Amazon S3.
Amazon S3 has a better price per terabyte than mounted disk drives.
However, this will need to be deferred to version 2 for the following reasons:
Cloud storage providers are accessed through custom RESTful interfaces
o This requires additional programming for reading and writing every file to the storage system.
The performance metrics are unknown
o Additional benchmarking and performance testing will be required
Additional management and monitoring tools may be required
41
NARA Catalog System Design
4.2 StructureThe structure of NARA Catalog storage is shown in the following diagram:
Dev / Test Sandbox Production Backup
Mount Points / Top Level Directories
Second level directories
OPA-IP Directories
Digitization Partner / Future Projects
Environments
Directory per partner / future project
Pre-Ingestion
Directory per project /Responsible entity
SEIP Directories, possibly compressed and bundled
The sub-directores are as follows:
/opa/
bulk/ Holds bulk-export files
<export-files> Currently holds bulk-export files. May be divided into multiple mount points later if the size / quantity of the bulk exports require it.
dev/ Holds content for the development environment
<naid-directories> See section 4.2.2
future/ Embargoed & digitization partner content (R-2.3.3.2)
<project-directories> Every project has a separate directory / mount
pre/ The pre-ingestion area, a holding area for new content
updates/ Used to hold new digital objects to ingest
quarantine Holds quarantine packages updates (R-2.10, R-2.11)
eop/ Every project has a separate directory / mount
quarantine Holds quarantine packages eop packages(R-2.10, R-2.11)
<other-projects>/ Every project has a separate directory / mount
quarantine Holds quarantine packages for the specified project(R-2.10, R-2.11)
42
NARA Catalog System Design
prod/ Holds content for the production environment
<naid-directories> See section 4.2.2
sandbox/ Holds content for the sandbox environment
<naid-directories> See section 4.2.2
4.2.1 Project Directories
Project directories will be created as one per project.
For the “future” directory
o Project directories will be for different digitization projects / partners
o This may contain embargoed data
o Access controls will be based on the individual project needs:
Full access to the project manager and designates- Access may be revoked once the project is “complete”
Full access to system administrators
No access to any NARA Catalog server process
For the “pre” directory
o “eop” – Holds new SEIPs from the EOP system
“quarantine” – packages which fail are copied to quarantine.- The reason for the failure will be in the log files.
Access controls:- Full access for systems and users which produce EOP SEIP packages
- Full access to system administrators
- Full access to NARA Catalog ingestion servers
- No access to other systems or individuals
o “updates” – Holds partial NARA Catalog-IP directories for updated digital objects.
Directories will be named with the description “naid”, and contain:- objects.xml – An XML file describing the object ID, object information,
and files for each object ID.
- content – A sub-directory holding the actual content files.
“quarantine” – Pre-ingestion packages which fail are copied to quarantine.
- The reason for the failure will be in the log files.
Access controls:
43
NARA Catalog System Design
- Full access for systems and users which produce new digital objects
- Full access to system administrators
- Full access to NARA Catalog ingestion servers
- No access to other systems or individuals
o <other projects> - Other directories may be created as necessary to handle additional data flows for new projects.
For example, a new pre-ingestion project directory will be created for each digitization partner project
Access controls:- Full access to the project owner
- Full access to system administrators
- Full access to NARA Catalog ingestion servers
- No access to other systems or individuals
What’s a Mount Point?
At this juncture, without knowing the cloud environment and without a definitive answer on the storage technology to be used, it is impossible to know what project directories will be separate mount points.
Tentatively:
Every project directory inside “future” will be a separate mount point
o It is expected that each of these directories will represent a significant amount of data.
o Currently, there are 64tb of data in the “future” directory in OPA Pilot.
The entire “pre” directory could be a single mount point.
o Since this is a transient directory, it is not expected to require much disk space.
o However, some of the <other-projects> may need to be separate mount points (TBD) depending on the amount of data and whether content is provided all at once, or in batches.
4.2.2 NAID Directories / Separate Environments
Dev/Test, Sandbox, Production, and Backup will all have NAID based directory structures. The NAID will be the same NAID for the object as specified in DAS for the associated archival description.
The directory structure will be as follows:
Level0: Environment (“dev”, “prod”, “sandbox”)
44
NARA Catalog System Design
Level1: NAID mod (numLevel1Dirs)
o Each environment will have a configured number of “Level1” directories.
For the first production release this will be 10
Maximum anticipated is 100
o Depending on the type of storage, each top-level directory may be a different mount-point.
Level2: (NAID/(numLevel1Dirs)) mod 10000
o Each level2 directory will contain up to 10,000 sub-directories.
Level3: des-NAID
o The entire NARA Catalog-ID will be used as the directory name for the NARA Catalog Information Package.
For example, for the NAID 5541536, the following levels would apply:
5 5 4 1 5 3 6Level 1
Level 2
Level 3
And the final directory would be: /opa/prod/36/5415/des-5541536
The purpose of using the lowest digits for level1 and level2 is to allow for a random distribution of files amongst those levels, so that a single directory path will be less likely to grow at a larger proportion than other directories.
Total size of the storage, assuming:
Level1 directories = 100
Level2 directories = 10,000
Level3 directories = 10,000
…is 10 billion packages (e.g. 10 billion descriptions).
Access Controls
The access controls for each of the different environments will be:
45
NARA Catalog System Design
“dev”
o Full access – All NARA Catalog staff (developers, system administrators, testers, etc.)
o Full access – All development server accounts
“prod”
o Full access – Ingestion server accounts
o Full access – Application server accounts
This is required to write transcriptions and translations into NARA Catalog Storage.
This may be changed to read/write, depending on whether or not application servers need to create sub-directories inside of NARA Catalog-IPs [Design TBD]
o Read Access – Reporting, monitoring, and server management accounts
o Read Access – Bulk-export server account
o Full access – System administrators
“sandbox”
o Full access – Sandbox ingestion server accounts
o Read access – Sandbox application server accounts
o Full access – System administrators
4.2.3 SFTP Server Access
SFTP servers will be installed on the content processing servers. SFTP servers will have access to:
/opa/pre – The pre-ingestion area
/opa/future – The future projects / digitization partner / embargoed data area
Notes on SFTP configuration:
Anonymous SFTP access must be disabled.
All users will require an account on NARA Catalog in order to upload content via SFTP.
SFTP will only be available from NARANet.
Access to the /opa/pre and the /opa/future directories will be configured with operating system access control as described in the previous sections.
46
NARA Catalog System Design
5 Backups & Recovery
This section covers backup and recovery methods.
5.1 BackupsNote that backups are required for the production environment only.
5.1.1 Backup Schedules
The following backup strategies will be required:
MySql Databases
o Daily incremental backup
o Weekly full backup
Search engine index files
o Daily incremental backup
o Weekly full backup
Content Processing / Ingestion servers
o Weekly full backup of cache files
Application Servers
o Weekly copy of log files to the reporting servers.
Reporting, Monitoring, and Admin Control
o Weekly incremental backup of log files
5.1.2 Backup Details
Details on the backup mechanism for each type of server will be outlined in the individual design documents:
For search engines: NARA Catalog Search Engine Design
For application servers: NARA Catalog Application Server Design
For the MySQL Database: NARA Catalog Application Server Design
For the Reporting, Monitoring and Admin Servers: NARA Catalog Reporting Design
For content processing: NARA Catalog Ingestion Design
47
NARA Catalog System Design
Note that all backup scripts and processes will be fully documented in the Administrator Guide when the system is delivered.
Backups of COTS software and system configuration files will be done after every deployment. This will be documented in the Deployment Guide.
5.1.3 Backup Storage
Backup Storage will be implemented with Amazon AWS “Glacier” storage.
5.1.4 Backup for NARA Catalog Storage
In order to meet up-time and recovery requirements for site-wide disaster scenarios, all of NARA Catalog storage will need to be backed up. Further, the backup should be done off-site.
Due to the size of NARA Catalog storage, the backup method will depend on the cloud environment chosen for NARA Catalog. The method of backup will be determined after consultation with NARA and the cloud provider.
5.2 Recovery from Server FailureThe recovery method will depend on the type of server.
5.2.1 Database Servers
A primary and a failover exist for the MySQL database servers.
A failure of either server will mean operating with a single server until its sibling server can be restored and the database mirrored.
See the NARA Catalog Application Server Design for details on the database recovery process.
5.2.2 Content Processing / Ingestion Servers
Recovery times for content processing servers are longer (2 days) than for other servers (90 minutes). Therefore, the recovery process for content processing will be:
1. Launch a new virtual machine for the server
2. Deploy the appropriate software to the server
a. Steps 1 & 2 could be combined if a “machine images” of the server is saved as part of the deployment procedure.
b. Note that maintenance of machine images are not specified as requirements.
3. Copy the appropriate backups to the server.
4. Reprocess updates since the last backup was saved.
48
NARA Catalog System Design
See the NARA Catalog Ingestion Design for details on the ingestion server recovery process.
5.2.3 Search Engine Servers
A primary and failover server exists for each search engine server.
Therefore, a failure of either server will mean operating with a single server for the specified index partition until its sibling server can be restored and the index copied.
See the NARA Catalog Search Engine Design for details on the database recovery process.
5.2.4 Application Servers
Application servers can be recovered at any time by simply launching a new instance of the server and adding it to the server farm. No recovery of backups is required.
See the NARA Catalog Application Server Design for details on the database recovery process.
5.2.5 Reporting, Monitoring & Admin Control
A primary and a failover exist for the MySQL database servers.
A failure of either server will mean operating with a single server until its sibling server can be restored and the backups recovered.
See the NARA Catalog Reporting Design for details on the recovery process for reporting functions.
See the NARA Catalog Search Engine Design for details on the recovery process for Zookeeper.
5.3 Recovery from Site FailureRecovery from site failure will require the following steps:
1. Launch new copies of all server instances.
2. Restore all databases from the latest backups.
3. Reprocess updates since the latest backups were required.
4. Re-index records as required.
It is conceivable that a complete re-index of all NARA Catalog content will be required to recover from a site-failure.
If this is the case, then multiple content ingestion servers may need to be launched to reprocess records in parallel, to perform a complete re-index within 7 days.
49
NARA Catalog System Design
6 System Monitoring
System monitoring will be performed using Amazon CloudWatch monitoring services.
50