67
National Archives and Records Administration National Archives Catalog (The Catalog) NARA Catalog System Design – Catalog Perspective – Status-Final Version 1.6 July 8, 2015

Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

National Archives and Records Administration

National Archives Catalog (The Catalog)

NARA Catalog System Design– Catalog Perspective –

Status-FinalVersion 1.6July 8, 2015

Page 3: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Contents

1 Overview................................................................................................................51.1 High-Level Architecture.....................................................................................................6

1.2 NARA Catalog in Context...................................................................................................7

1.2.1 NARA Catalog Production..........................................................................................7

1.2.2 NARA Catalog Sandbox..............................................................................................9

1.3 Applicable Requirements.................................................................................................10

1.3.1 Sandbox Environment and Segregated Storage......................................................10

1.3.2 Performance Requirements....................................................................................10

1.3.3 Availability...............................................................................................................12

1.3.4 Volume....................................................................................................................13

1.3.5 Security Requirements............................................................................................14

2 Hardware and Network Design.............................................................................222.1 Production System...........................................................................................................22

2.1.1 Assumptions............................................................................................................22

2.1.2 Server Hardware.....................................................................................................22

2.1.3 NARA Catalog Storage Hardware............................................................................28

2.1.4 Network Hardware..................................................................................................28

2.2 Sandbox Environment......................................................................................................30

2.3 Development System.......................................................................................................31

2.4 UAT System......................................................................................................................32

2.4.1 UAT to PROD Proceedure........................................................................................32

2.5 Example 2014 and 2015 NARA Catalog Prod Computations............................................33

2.5.1 Example Server Requirements................................................................................33

2.5.2 Elastic Scalability.....................................................................................................34

2.5.3 Unknowns...............................................................................................................35

2.5.4 Computing Server Requirements for Index Entries of Varying Size.........................35

3 Operating System Design......................................................................................373.1 Kernel Configuration........................................................................................................37

3.2 Memory Configuration....................................................................................................37

Page 4: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

3.3 Accounts..........................................................................................................................37

3.4 Auditing...........................................................................................................................38

3.5 Ports Configuration..........................................................................................................38

3.6 Clock Synchronization......................................................................................................38

3.7 SSH...................................................................................................................................39

3.8 Maintaining and Patching the Operating System.............................................................39

4 Storage Design......................................................................................................404.1 Storage Technology for NARA Catalog Prod....................................................................40

4.1.1 Version 1.................................................................................................................40

4.1.2 Version 2.................................................................................................................41

4.2 Structure..........................................................................................................................42

4.2.1 Project Directories...................................................................................................43

4.2.2 NAID Directories / Separate Environments.............................................................44

4.2.3 SFTP Server Access..................................................................................................46

5 Backups & Recovery.............................................................................................475.1 Backups............................................................................................................................47

5.1.1 Backup Schedules....................................................................................................47

5.1.2 Backup Details.........................................................................................................47

5.1.3 Backup Storage.......................................................................................................48

5.1.4 Backup for NARA Catalog Storage...........................................................................48

5.2 Recovery from Server Failure...........................................................................................48

5.2.1 Database Servers.....................................................................................................48

5.2.2 Content Processing / Ingestion Servers...................................................................48

5.2.3 Search Engine Servers.............................................................................................49

5.2.4 Application Servers.................................................................................................49

5.2.5 Reporting, Monitoring & Admin Control.................................................................49

5.3 Recovery from Site Failure...............................................................................................49

6 System Monitoring...............................................................................................50

Page 5: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Version Control

Version Date Reviewer Summary Description

1.0 2014-03-02 Paul Nelson Complete first version for NARA review

1.1 2014-03-16 Paul Nelson Incorporate changes from DCRF

1.2 2014-04-11 Madhu Koneni Adjusted the servers configuration based on what AWS provides

1.3 2014-05-21 Paul Nelson Updates from NARA SE Architecture review

1.4 2014-11-14 Kristy Martin Removed “Confidential to Search Technologies” text from the footer

1.5 2014-11-24 Brandon Stahl Replaced https://research.archives.gov url with https://catalog.archives.gov url

1.6 2015-07-08 Brandon Stahl Rebranded OPA as NARA Catalog

5

Page 6: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

1 Overview

This document is the system design including hardware specifications for the National Archives Catalog system currently being developed for the National Archives and Records Administration (NARA).

Specifically, this document will cover:

Server requirements for NARA Catalog Production, including:

o Server machines

o Server specifications

Disk space requirements for NARA Catalog Production, including:

o Type of disk space

o Size and I/O access requirements

Networking requirements for NARA Catalog Production, including:

o Network connectivity to the internet

o Network connectivity to NARANet

o Load-balancing / routing

Requirements for other NARA Catalog Systems, including:

o The sandbox environment

o The developer environment

o UAT environment

Other system tools, including:

o SFTP service for ingestion of digital objects

o System monitoring

6

Page 7: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

1.1 High-Level ArchitectureThe following diagram provides an overview of all NARA Catalog systems:

Content Processing Search Array

Annotations and Registration

Database

DAS

OPAApplication

Server

Sear

ch A

PIAn

nota

tions

API

Auth

entic

ation

API

OPAUser

Interface

(Twitter Bootstrap,Angular.js)

Web Browser

AuthorizedUsers

Interface

Auth

orize

d U

sers

API

Multiple pipelinesfor multiple content processing

data flows

Acce

ss A

PI

...

...

OPA Storage

Internet

sftpContentSubmissions

ContentSubmissions

Bulk-Exportsserver

The purpose of each system is as follows:

Content Processing – is back-end system responsible for ingestion, maintaining NARA Catalog storage, and keeping the search engine indexes up-to-date.

Search Array – Is the search engine itself, structured as a series of independent search nodes, each one responsible for searching a portion of the entire index (index portions are called “shards” and should hold around 25-50 million records). Each search node has a redundant copy to increase query capacity and for failover.

NARA Catalog Storage – Is the long-term content storage for all publicly available NARA data. NARA Catalog storage will contain a copy of NARA data so that it can be delivered quickly and efficiently to the public.

Annotations and Registration Database – Contains registered user account information, tags, comments, transcriptions and translations as well as bookkeeping information for all annotations such as lists of recently created or modified annotations, annotations per user, etc.

Application Server – This is the system (likely made up of multiple servers) which handles all end-user and authorized user requests.

Client – Client software will be written in Javascript and HTML-5 and will run inside the user’s web browser.

7

Page 8: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

1.2 NARA Catalog in ContextThe following diagrams show how NARA Catalog fits “into context” with the other processes and systems in the National Archives.

There will be two contextual diagrams, one for NARA Catalog Production, and a second for the NARA Catalog Sandbox system. (Note: These diagrams are preliminary)

1.2.1 NARA Catalog Production

The following diagram shows how NARA Catalog Production fits into the rest of the NARA environment:

OPA Production

OPA Storage

Annotations and Registration

Database

Search Engine

Ingestion and Content

ProcessingApplication

Server

TrustedRepository

(future)

DigitalProcessing

Environment(DPE)

Description & Authority Service

(DAS)

NARA Moderator

NARA Moderator

NARA AuthorizedUser

NARA AuthorizedUser

Non-professional Users

Non-professional Users

ResearchersResearchers

NARAResearch Support

Services

NARAResearch Support

Services

Content OwnersContent Owners

ContentUpdates

Third PartyAPI User

1.2.1.1 Content Providers

In the above diagram, systems which provide content (or are anticipated to someday provide content) are shown on the left, and consumers of NARA Catalog Production services are shown on the right.

NARA Catalog Production will receive updates from The NARA Description & Authority Service (DAS) as well as the Digital Processing Environment (DPE) through the Trusted Repository (TBD) (only trusted content with full digital provenance should be ingested into NARA Catalog).

Content updates will be provided by content owners as new files are scanned and/or content modifications are required. This can include storage of content (for example, from digitization partners) for future processing.

8

Page 9: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

1.2.1.2 Consumers of NARA Catalog Services

Users of NARA Catalog include the following categories. Of course, any single person can be in all of these roles (the categories are not exclusive):

Non-Professional Users – These are members of the public who are professional or academic researchers. These users fall into various categories:

o Occasional searchers – Log into NARA on occasion for occasional searching, for example to find family members or fellow soldiers.

o Contributors – These are users who help contribute to the archive with comments, tags, transcriptions, or translations.

Researchers – Researchers are looking for specific source materials for specific research goals. For example, to research a biography of a famous politician.

Third Party API users – These are third party organizations that wish to interact programmatically with the NARA Catalog system, for example to bulk export images (The Digital Public Library of America, or Wikipedia) or to create new custom interfaces for searching NARA Catalog content.

NARA Research Support Services – These are NARA employees who help researchers. It is expected they will be users of NARA Catalog to help the public find information.

NARA Contribution Moderators – These are NARA employees who review contributions from the public. Content which is spam or vandalism will be removed (with a comment).

NARA Authorized Users – These are NARA employees responsible for managing the user account database. They can deactivate and re-activate registered users and respond to support call requests (e.g. change my password, etc.).

9

Page 10: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

1.2.2 NARA Catalog Sandbox

The purpose of NARA Catalog Sandbox is to “trial run” new content before it is posted on-line to the general public. In this capacity, it will have a different set of content providers and consumers, as shown below:

OPA Sandbox

OPA Sandbox Storage

Search Engine

Ingestion and Content

ProcessingSandbox

ApplicationServerDigital

ProcessingEnvironment

(DPE)

Description & Authority Service

(DAS)

NARAArchivists

NARAArchivists

OPA ReviewBoard

OPA ReviewBoard

ContentOwnersContentOwners

Content OwnersContent Owners

ContentUpdates

It is expected that DPE will provide content directly to the NARA Catalog sandbox, so the content can be tested in NARA Catalog before it is written to the trusted repository. Similarly, NARA Catalog sandbox will need to pull description data from DAS, as it would normally need to do for any sort of content ingestion.

NARA Catalog sandbox is not available for public consumption. Instead, the NARA Catalog sandbox application will be used only by:

Content Owners – Who need to view and test their content in NARA Catalog Sandbox so they can verify its accuracy before it is moved on-line for the public.

NARA Archivists – Who will need to view the content as well, to ensure that it meets archival standards.

The NARA Catalog Review Board – An interdisciplinary group who verifies the quality of the content (and the data description files) as necessary before content can be officially moved to the public.

10

Page 11: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

1.3 Applicable RequirementsThe requirements which drive the system design are identified in the following table along with the section of the document to which the requirement is allocated.

1.3.1 Sandbox Environment and Segregated Storage

Requirement Requirement Text Section

2.3.1 The NARA Catalog system shall provide a sandbox for a data producer to deposit records.

2.2

2.3.1.1 The sandbox shall allow indexing of the deposited records. 2.22.3.1.2 The sandbox shall allow searching of the deposited records by authorized users. 2.22.3.2 The sandbox shall index records that are not yet released for search by public

users.2.2

2.3.3 The NARA Catalog system shall exclude records from the search that are not yet released for public access.

2.2

2.3.3.1 The NARA Catalog system shall provide the capability for a System Administrator to set an embargo date on data that will not be available to the public.

4.2

2.3.3.2 The NARA Catalog system shall have a segregated storage space for digital objects that are not yet publicly available for search.

4.2

2.10 The NARA Catalog system shall provide a staging area to store SEIP packages that do not contain a description in DAS.

4.2

2.11 The NARA Catalog system shall provide a staging area to store digital objects that do not contain a description in DAS.

4.2

1.3.2 Performance Requirements

Requirement Requirement Text Section

10.1 The NARA Catalog system shall have response times for returning a search result.

2.1.2

10.1.1 The NARA Catalog system response time for returning a search results shall be less than 1 second for 90% of queries, not including network transfer to/from the browser.

2.1.2

10.1.2 The NARA Catalog system response time for returning a search results shall be less than 2 second for 98% of queries, not including network transfer to/from the browser.

2.1.2

11

Page 12: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

10.1.3 The NARA Catalog system response time for returning a search results shall be less than 3 second for 99% of queries, not including network transfer to/from the browser.

2.1.2

10.1.4 The NARA Catalog system response time for returning a search results shall be less than 5 second for 99.99% of queries, not including network transfer to/from the browser.

2.1.2

10.2 The NARA Catalog system shall have response times for navigating between screens.

2.1.2

10.2.1 The NARA Catalog system response times for navigating between screens when no search is involved shall be 99% within 1 second, not including network transfer to/from the browser.

2.1.2

10.2.2 The NARA Catalog system response times for navigating between screens when no search is involved shall be 99.99% within 2 seconds, not including network transfer to/from the browser.

2.1.2

10.2.3 The NARA Catalog system response times for navigating from page to page of search results shall be less than 1 second for 90% of queries, not including network transfer to/from the browser.

2.1.2

10.2.4 The NARA Catalog system response times for navigating from page to page of search results shall be less than 2 seconds for 98% of queries, not including network transfer to/from the browser.

2.1.2

10.2.5 The NARA Catalog system response times for navigating from page to page of search results shall be less than 5 seconds for 99.99% of queries, not including network transfer to/from the browser.

2.1.2

10.2.6 The NARA Catalog system response times for navigating between screens that are not search results shall be a maximum of one (1) second.

2.1.2

10.3 The NARA Catalog system shall be capable of supporting at a minimum one (1) million user accounts.

2.1.2.1

10.4 The NARA Catalog system shall be able to provide sustained query performance of no less than sixty (60) queries per second for queries executed in sequence.

2.1.2

10.4.1 The NARA Catalog system shall provide a procedure for increasing query capacity (queries per second) as needed to handle expected capacity increases, with a maximum required lead time of two (2) weeks.

2.5.2

10.4.2 The NARA Catalog system shall provide a procedure for decreasing query capacity (queries per second) when increased query capacity is no longer required, but not less than the base query capacity provided at production launch.

2.5.2

10.5 <allocated to NARA Catalog Search Engine Design>10.6 The NARA Catalog system shall support normal traffic, at a minimum, two-

thousand (2,000) concurrent users.2.1.2

10.7 The NARA Catalog system shall support surge traffic, at a minimum, twenty-thousand (20,000) concurrent users.

2.1.2

10.8 The NARA Catalog system shall be able to provide peak query performance of one-hundred (100) queries per second, for queries executed in sequence.

2.1.2

12

Page 13: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

1.3.3 Availability

Requirement

Requirement Text Section

11.1 The NARA Catalog system shall be 99.5% available for search and other processing 24 hours a day/7 days a week.

5.2

11.2 The NARA Catalog system shall be able to recover system functionality after a failure.

5.2

11.2.1 The NARA Catalog system shall be able to recover search functionality from any hardware failure within the NARA Catalog system within 90 minutes.

5.2

11.2.2 The NARA Catalog system shall be able to recover search functionality from any software failure within the NARA Catalog system within 90 minutes.

5.2

11.2.3 The NARA Catalog system shall be able to recover tags from any hardware failure within the NARA Catalog system within 90 minutes.

5.2

11.2.4 The NARA Catalog system shall be able to recover tags from any software failure within the NARA Catalog system within 90 minutes.

5.2

11.2.5 The NARA Catalog system shall be able to recover comments from any hardware failure within the NARA Catalog system within 90 minutes.

5.2

11.2.6 The NARA Catalog system shall be able to recover comments from any software failure within the NARA Catalog system within 90 minutes.

5.2

11.2.7 The NARA Catalog system shall be able to recover translations from any hardware failure within the NARA Catalog system within 90 minutes.

5.2

11.2.8 The NARA Catalog system shall be able to recover translations from any software failure within the NARA Catalog system within 90 minutes.

5.2

11.2.9 The NARA Catalog system shall be able to recover transcriptions from any hardware failure within the NARA Catalog system within 90 minutes.

5.2

11.2.10 The NARA Catalog system shall be able to recover transcriptions from any software failure within the NARA Catalog system within 90 minutes.

5.2

11.2.11 The NARA Catalog system shall be able to recover login functionality from any hardware failure within the NARA Catalog system within 90 minutes.

5.2

11.2.12 The NARA Catalog system shall be able to recover login functionality from any software failure within the NARA Catalog system within 90 minutes.

5.2

11.2.13 The NARA Catalog system shall be able to recover API functionality from any hardware failure within the NARA Catalog system within 90 minutes.

5.2

11.2.14 The NARA Catalog system shall be able to recover API functionality from any software failure within the NARA Catalog system within 90 minutes.

5.2

11.2.15 The NARA Catalog system shall be able to recover the ingest functionality from any hardware failure within 2 days.

5.2

11.2.16 The NARA Catalog system shall be able to recover the ingest functionality from any software failure within 2 days.

5.2

11.2.17 The NARA Catalog system shall be able to recover the reporting functionality from any hardware failure within 2 days.

5.2

13

Page 14: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

11.2.18 The NARA Catalog system shall be able to recover the reporting functionality from any software failure within 2 days.

5.2

11.2.19 The NARA Catalog system shall be able to recover authorized user interfaces from any hardware failure within 2 days.

5.2

11.2.20 The NARA Catalog system shall be able to recover authorized user interface functionality from any software failure within 2 days.

5.2

11.2.21 The NARA Catalog system shall be able to recover the information exchange functionality from any hardware failure within 2 days.

5.2

11.2.22 The NARA Catalog system shall be able to recover the information exchange functionality from any software failure within 2 days.

5.2

11.2.23 The NARA Catalog system shall be able to recover all functionality from a site-wide system failure within 7 days, with no more than 48 hours of data loss.

5.3

1.3.4 Volume

Requirement Requirement Text Section

12.1 The NARA Catalog system configuration shall provide the capability to scale on demand.

2.5.2

12.1.1 The NARA Catalog system architecture shall be capable of supporting a minimum of 10,000 terabytes of NARA Catalog source data, and scalable up to 57,000 terabytes of NARA Catalog source data.

2.1.3

12.1.2 <Allocated to Search Engine Design> 412.1.2.1 The NARA Catalog system architecture shall be capable of holding a minimum

of 500 million digital objects.2.1.3,

2.1.2.312.1.3 <Allocated to Search Engine Design>12.1.3.1 The NARA Catalog system architecture shall be capable of holding a minimum

of 20 million archival description records.2.1.2.3

12.1.4 The NARA Catalog system architecture shall be capable of supporting a minimum of 20 million authority records.

2.1.2.3

12.1.4.1 The NARA Catalog system architecture shall be capable of holding a minimum of 10 million authority records.

2.1.2.3

14

Page 15: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

1.3.5 Security Requirements

The following security requirements are allocated to the system design.

Requirement Requirement Text Section

13.1 The NARA Catalog system shall be implemented in compliance with NARA security guidance as provided by NARA in the NARA Catalog and Cloud Service Provider Baseline Security Controls.

3

13.1.1 The NARA Catalog system shall be delivered with any guest accounts disabled for COTS products installed on the system. (1.1 Access Control, AC-2)

3.3

13.1.2 The NARA Catalog system shall automatically terminate temporary and emergency accounts after a period not to exceed 15 days for unclassified information systems (1.1 Access Control, AC-2 (2))

3.3

13.1.3 The NARA Catalog system shall automatically disable inactive accounts after [a period not to exceed 365 days]. (1.1 Access Control, AC-2 (3))

3.3

13.1.4 The NARA Catalog system shall automatically audit account creation, modification, disabling, and termination actions and notifies, as required, appropriate individuals. (1.1 Access Control, AC-2 (4))

3.4

13.1.5 The NARA Catalog system shall isolate the programs and data areas of users from other users and the system itself. (1.1 Access Control, AC-3)

3, 4.2

13.1.5.1 The NARA Catalog system shall provide for the capability to enforce role-based access control policies. (1.1 Access Control, AC-3)

3.3

13.1.6 The NARA Catalog system shall enforce approved authorizations for controlling the flow of information within the system and between interconnected systems in accordance with applicable policy. (1.1 Access Control, AC-4)

3.5, 2.1.4

13.1.7 The NARA Catalog system shall provide the capability to enforce the concept of least privilege, allowing only authorized accesses for users (and processes acting on behalf of users) which are necessary to accomplish assigned tasks in accordance with NARA missions and business functions. (1.1 Access Control, AC-6)

4.2, 3.3

13.1.7.1 The NARA Catalog system shall be able to enforce restrictions for access to security-related functions. (1.1 Access Control, AC-6 (1)) Examples of security functions include but are not limited to: establishing system accounts, configuring access authorizations (i.e., permissions, privileges), setting events to be audited, system programming, system and security administration, and other privileged functions.

3.3

13.1.8 The NARA Catalog system shall enforce a limit of [a maximum of 5] consecutive invalid login attempts by a user during a [15 minute period]. (1.1 Access Control, AC-7a)

3.3

13.1.8.1 The NARA Catalog system shall automatically [locks the account/node for at least 15 minutes ] when the maximum number of unsuccessful attempts is exceeded. (1.1 Acces Control, AC-7b)

3.3

15

Page 16: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

13.1.9 The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that provides privacy and security notices consistent with applicable federal laws, Executive Orders, directives, policies, regulations, standards, and guidance. (1.1 Access Control, AC-8a)

3.3

13.1.9.1 The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states users are accessing a U.S. Government information system. (1.1 Access Control, AC-8a)

3.3

13.1.9.2 The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states system usage may be monitored, recorded, and subject to audit. (1.1 Access Control, AC-8a)

3.3

13.1.9.3 The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states unauthorized use of the system is prohibited and subject to criminal and civil penalties. (1.1 Access Control, AC-8a)

3.3

13.1.9.4 The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states use of the system indicates consent to monitoring and recording. (1.1 Access Control, AC-8a)

3.3

13.1.9.5 The NARA Catalog system shall retain the notification message or banner on the screen until users take explicit actions to log on to or further access the information system. (1.1 Access Control, AC-8b)

3.3

13.1.9.6 The NARA Catalog system shall display the system use information when appropriate, before granting further access. (1.1 Access Control, AC-8c)

3.3

13.1.9.6.1 The NARA Catalog system shall display references, if any, to monitoring, recording, or auditing that are consistent with privacy accommodations for such systems that generally prohibit those activities; and Include in the notice given to public users of the information system, a description of the authorized uses of the system. (1.1 Access Control, AC-8c)

3.3

13.1.10 The NARA Catalog system shall be capable of auditing successful and unsuccessful account logon events, account management events, object access, policy change, privilege functions, process tracking, and system events. (1.3 Audit and Accountability, AU-2)

3.4

13.1.11 The NARA Catalog system shall be capable of auditing all administrator activity, authentication checks, authorization checks, data deletions, data access, data changes, and permission changes. (1.3 Audit and Accountability, AU-2)

3.4

13.1.12 The NARA Catalog system shall produce audit records that contain sufficient information to, at a minimum, establish what type of event occurred, when (date and time) the event occurred, where the event occurred, the source of the event, the outcome (success or failure) of the event, and the identity of any user/subject associated with the event. (1.3 Audit and Accountability, AU-3)

3.4

16

Page 17: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

13.1.12.1 The NARA Catalog system shall produce audit records for data requiring moderate or high integrity, the information system shall include the date and time of the event; the component of the information system (e.g., software component, hardware component) where the event occurred; type of event; subject identity; and the outcome (success or failure) of the event.] in the audit records for audit events identified by type, location, or subject. (1.3 Audit and Accountability, AU-3 (1))

3.4

13.1.13 The NARA Catalog system shall allocate audit record storage capacity and configure auditing to reduce the likelihood of such capacity being exceeded. (1.3 Audit and Accountability, AU-4)

3.4

13.1.14 The NARA Catalog system shall alert designated NARA officials in the event of an audit processing failure. (1.3 Audit and Accountability, AU-5a)

3.4, 6

13.1.14.1 The NARA Catalog system shall overwrite the oldest audit records after an audit processing failure, for low or moderate integrity information systems. (1.3 Audit and Accountability, AU-5b)

3.4

13.1.15 The NARA Catalog system shall provide an audit reduction and report generation capability. (1.3 Audit and Accountability, AU-7)

3.4

13.1.16 The NARA Catalog system shall provide the capability to automatically process audit records for events of interest based on selectable event criteria. (1.3 Audit and Accountability, AU-7(1))

3.4

13.1.17 The NARA Catalog system shall use internal system clocks to generate time stamps for audit records. (1.3 Audit and Accountability, AU-8)

3.6

13.1.18 The NARA Catalog system shall synchronize internal information system clocks [or at least every 24 hours] with [NARA’s authoritative time source]. (1.3 Audit and Accountability, AU-8 (1))

3.6

13.1.19 The NARA Catalog system shall protect audit information and audit tools from unauthorized access, modification, and deletion. (1.3 Audit and Accountability, AU-9)

3.4

13.1.19.1 The NARA Catalog system shall provide the capability to log actual and attempted machine access to the audit log. (1.3 Audit and Accountability, AU-9)

3.4

13.1.20 The NARA Catalog system shall provide audit record generation capability for the list of auditable events defined in AU-2. (1.3 Audit and Accountability, AU-12a)

3.4

13.1.20.1 The NARA Catalog system shall allow designated NARA personnel to select which auditable events are to be audited by specific components of the system. (1.3 Audit and Accountability, AU-12b)

3.4

13.1.20.2 The NARA Catalog system shall generate audit records for the list of audited events defined in AU-2 with the content as defined in AU-3. (1.3 Audit and Accountability, AU-12c)

3.4

13.1.20.3 The NARA Catalog system shall capture error logs from COTS products. (1.3 Audit and Accountability, AU-12)

2.1.2.4

13.1.20.4 The NARA Catalog system shall capture Operating System errors. (1.3 Audit and Accountability, AU-12)

3.4

13.1.20.5 <Allocated to NARA Catalog Application Server Design

17

Page 18: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

13.1.20.6 The NARA Catalog system shall co-locate COTS error logs from different locations to a common storage location. (1.3 Audit and Accountability, AU-12)

2.1.2.4

13.1.20.7 <Allocated to NARA Catalog Ingestion Design, NARA Catalog Application Server Design, NARA Catalog Search Engine Design>

13.1.20.8 The NARA Catalog system shall provide error detection when accessing memory via parity and/or hardware register checking, as available by the cloud environment selected by the government for hosting the NARA Catalog servers. (1.3 Audit and Accountability, AU-12)

3.2

13.1.21 The NARA Catalog system shall implement configuration settings for information technology products employed within the information system using [Security Architecture security configuration checklists approved and published by NARA IT Security Staff (NHI)] that reflect the most restrictive mode consistent with operational requirements. (1.5 Configuration Management, CM-6)

3

13.1.21.1 <Allocated to NARA Catalog Application Server Design for MySQL and JBoss Configuration>

13.1.22 The NARA Catalog system shall use the Center for Internet Security guidelines (Level 1) to disable ports, protocols, and/or services identified in the configuration guides. (1.5 Configuration Management, CM-7)

3.5

13.1.23 The NARA Catalog system shall provide the capability for backup of the system. (1.6 Contingency Planning, CP-9)

5.1

13.1.24 The NARA Catalog system shall provide the capability for the backup of COTS product files as required to restore operational capability. (1.6 Contingency Planning, CP-9)

5.1

13.1.25 The NARA Catalog system shall provide the capability for the backup of application files as required to restore operational capability. (1.6 Contingency Planning, CP-9)

5.1

13.1.26 The NARA Catalog system shall provide the capability for the backup of configuration support files as required to restore operational capability. (1.6 Contingency Planning, CP-9)

5.1

13.1.27 The NARA Catalog system shall provide the capability for the backup of the files listed in the NARA Catalog Administration Guide, Section 5. (1.6 Contingency Planning, CP-9)

5.1

13.1.28 The NARA Catalog system shall provide the capability to cancel a scheduled backup process subject based on permissions. (1.6 Contingency Planning, CP-9)

5.1

13.1.29 The NARA Catalog system shall provide the capability to cancel a manual backup process subject based on permissions. (1.6 Contingency Planning, CP-9)

5.1

13.1.30 The NARA Catalog system shall provide the capability to recover the system. (1.6 Contingency Planning, CP-10)

5.2, 5.3

13.1.31 The NARA Catalog system shall provide the capability to recover COTS product files to restore operational capability. (1.6 Contingency Planning, CP-10)

5.2, 5.3

13.1.32 The NARA Catalog system shall provide the capability to recover application files to restore operational capability. (1.6 Contingency Planning, CP-10)

5.2, 5.3

18

Page 19: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

13.1.33 The NARA Catalog system shall provide the capability to recover configuration support files to restore operational capability. (1.6 Contingency Planning, CP-10)

5.2, 5.3

13.1.34 The NARA Catalog system shall provide the capability to recover from a hardware failure. (1.6 Contingency Planning, CP-10)

5.2, 5.3

13.1.35 The NARA Catalog system shall provide the capability to recover from a physical site outage. (1.6 Contingency Planning, CP-10)

5.2, 5.3

13.1.36 The NARA Catalog system shall uniquely identify and authenticate users (or processes acting on behalf of users). (1.7 Identification & Authentication, IA-2)

3.3

13.1.37 The NARA Catalog system shall protect authenticator content from unauthorized disclosure and modification. (1.7 Identification & Authentication, IA-5)

3.3

13.1.38 The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, enforce minimum password complexity of [a case sensitive, 8-character mix of upper case letters, lower case letters, numbers, and special characters, including at least one of each]. (1.7 Identification & Authentication, IA-5(1))

3.3

13.1.38.1 The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, enforce at least a [four character change] when new passwords are created. (1.7 Identification & Authentication, IA-5(1))

3.3

13.1.38.2 The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, encrypt passwords in storage and in transmission. (1.7 Identification & Authentication, IA-5(1))

3.3

13.1.38.3 The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, enforce password minimum and maximum lifetime restrictions of [1 day minimum, 90 day maximum]; and Prohibit password reuse for [a minimum of 5 for unclassified information systems] generations.(1.7 Identification & Authentication, IA-5(1))

3.3

13.1.38.4 <Not allocated to system design pertains to public users only>13.1.39 The NARA Catalog system shall obscure feedback of authentication information

during the authentication process to protect the information from possible exploitation/use by unauthorized individuals. (1.7 Identification & Authentication, IA-6)

3.3

13.1.40 The NARA Catalog system shall use mechanisms for authentication to a cryptographic module that meet the requirements of applicable federal laws, Executive Orders, directives, policies, regulations, standards, and guidance for such authentication. (1.7 Identification & Authentication, IA-7) NARA Guidance: This requirement means that cryptographic modules used for identification and authentication must meet FIPS 140-2 standards."

3.3

13.1.41 TheNARA Catalog system shall uniquely identify and authenticate non-NARA users (or processes acting on behalf of non-NARA users). (1.7 Identification & Authentication, IA-8)

3.3

19

Page 20: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

13.1.42 The NARA Catalog system shall separate user functionality (including user interface services) from information system management functionality. (System and Communications Protection, SC-2) Supplemental Guidance: Information system management functionality includes, for example, functions necessary to administer databases, network components, workstations, or servers, and typically requires privileged user access. The separation of user functionality from information system management functionality is either physical or logical and is accomplished by using different computers, different central processing units, different instances of the operating system, different network addresses, combinations of these methods, or other methods as appropriate. An example of this type of separation is observed in web administrative interfaces that use separate authentication methods for users of any other information system resources. This may include isolating the administrative interface on a different domain and with additional access controls."

3.3

13.1.43 "The NARA Catalog system shall prevent unauthorized and unintended information transfer via shared system resources. (System and Communications Protection, SC-4) Supplemental Guidance: The purpose of this control is to prevent information, including encrypted representations of information, produced by the actions of a prior user/role (or the actions of a process acting on behalf of a prior user/role) from being available to any current user/role (or current process) that obtains access to a shared system resource (e.g., registers, main memory, secondary storage) after that resource has been released back to the information system. Control of information in shared resources is also referred to as object reuse. This control does not address: (i) information remanence which refers to residual representation of data that has been in some way nominally erased or removed; (ii) covert channels where shared resources are manipulated to achieve a violation of information flow restrictions; or (iii) components in the information system for which there is only a single user/role."

3.3

13.1.44 The NARA Catalog system shall monitor and control communications at the external boundary of the system and at key internal boundaries within the system. (System and Communications Protection, SC-7a)

2.1.4

13.1.44.1 The NARA Catalog system shall connect to external networks or information systems only through managed interfaces consisting of boundary protection devices arranged in accordance with the NARA security architecture. (System and Communications Protection, SC-7b)

2.1.4

13.1.44.2 The NARA Catalog system shall configure external firewalls to permit only the minimum protocols through that are required for the system to function. (System and Communications Protection, SC-7)

2.1.4

13.1.44.3 The NARA Catalog system shall configure external firewalls to ignore external ICMP 'echo' requests to the system. (System and Communications Protection, SC-7)

2.1.4

20

Page 21: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

13.1.44.4 The NARA Catalog system shall configure external firewalls to ignore external UDP 'chargen' requests to the system. (System and Communications Protection, SC-7)

2.1.4

13.1.45 The NARA Catalog system shall protect the integrity of transmitted information. (System and Communications Protection, SC-8)

3.7, 4.2.3

13.1.46 The NARA Catalog system shall employ cryptographic mechanisms to recognize changes to information during transmission. (System and Communications Protection, SC-8 (1))

3.7, 4.2.3

13.1.47 The NARA Catalog system shall protect the confidentiality of transmitted information. (System and Communications Protection, SC-9)

3.7, 4.2.3

13.1.48 The NARA Catalog system shall employ cryptographic mechanisms to prevent unauthorized disclosure of information during transmission. (System and Communications Protection, SC-9(1))

3.7, 4.2.3

13.1.49 "The NARA Catalog system shall terminate the network connection associated with a communications session at the end of the session or after no more than 30 minutes of inactivity for a backend user. (System and Communications Protection, SC-10) SC-10 Guidance: Long running batch jobs and other necessary operations are not subject to this time limit.

2.1.4

13.1.50 <Not allocated to system design pertains to public users only>13.1.51 The NARA Catalog system shall implement required cryptographic protections

using cryptographic modules that comply with applicable federal laws, Executive Orders, directives, policies, regulations, standards, and guidance. (System and Communications Protection, SC-13) NARA Guidance: This requirement means that any cryptographic modules used must meet FIPS 140-2 standards."

3.7, 4.2.3

13.1.52 The NARA Catalog system shall protect the integrity and availability of publicly available information and applications. (System and Communications Protection, SC-14)

4

13.1.53 The NARA Catalog system shall prohibit remote activation of collaborative computing devices (if collaborative computing mechanisms are used). (System and Communications Protection, SC-15)

3

13.1.54 The NARA Catalog system shall provide mechanisms to protect the authenticity of communications sessions. (System and Communications Protection, SC-23)

3.7, 4.2.3

13.1.55 The NARA Catalog system shall protect the confidentiality and integrity of information at rest. (System and Communications Protection, SC-28)

4, 5.1

13.1.56 <Allocated to NARA Catalog Application Server Design>13.1.57 <Allocated to NARA Catalog Application Server Design>

21

Page 22: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

2 Hardware and Network Design

This section covers the anticipated hardware and network required to meet NARA Catalog production initial system requirements

2.1 Production System

2.1.1 Assumptions

This production system is scaled to meet the following stated requirements:

500 million digital objects

30 million records (20 million descriptions, 10 million authorities)

2000 sustained concurrent users, 20,000 peak

Further, we assume that every digital object is a separate digital object file as specified in the current system with an <object> tag in the archival description.

See section 2.5 for an example for how to compute 2014 and 2015 NARA Catalog Prod requirements for smaller volumes or a mixture of different types of index entries.

2.1.2 Server Hardware

The following diagram shows all of the hardware servers and networks anticipated for the NARA Catalog Production system.

22

Page 23: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

OPA Production

Search Engine Array

Storage

Search Engine Array

Content Processing

& FTP

Internet

DAS

2 (capacity) 25 + 25 (failover & QPS)

Database

1 + 1 (failover)

ApplicationServers

Internet

3 (capacity) + 1 (failover) +

1 (bulk exports)

OPAStorage

FTPClient

Reporting, Monitoring & Admin Control

1 + 1 (failover)

All standard server machines are expected to be modern machines with the minimum characteristics:

2.5mb processor cache per core

3 Ghz CPU clock speed or better

SAS (Serial Attached SCSI) Hard Drives

o Note: SATA drives will not provide sufficient I/O bandwidth capacity for NARA Catalog applications.

o SAN storage is also a viable option, as long as IO operations/second are sufficiently capable

RAID 1, RAID 5, or RAID 10 for all hard drives

Disks for servers must not be shared

o The architecture is designed to be a “share nothing” system

o The only shared storage is NARA Catalog Storage

o All hard disk drives on each machine must be dedicated spindles.

o This is especially critical for servers listed with IOPS of “high” below

23

Page 24: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Specific requirements on RAM and number of processing cores per server are identified below:

System Purpose Cnt RAM Cores HD IOPS Comments

database primary 1 122gb 16 250gb

high MySQL Server

database failover 1 122gb 12 250gb

high MySQL Server

Content Processing

primary 2 30gb 8 2tb med Content processing & SFTP server. Note that two servers provide capacity.Failure of one server will reduce ingestion capacity.

Search Engine

primary 25 60.5gb 16 1tb high Solr Search Search servers for 530 million medium-sized index entries divided 25 ways.

Search Engine

failover 25 60.5gb 16 1tb high Failover row, holds index replicas for primary row.

Web Application

Primary 4 30gb 16 1tb low Holds application servers to handle API requests from end-user interfaces. 4 servers are recommened for load balancing and fail over.Disk space is for holding 1yr of log data.

Bulk Export primary 1 30gb 8 100gb

low Server to process bulk exports in background. Output is written to NARA Catalog Storage.

Reporting, Monitoring, Server Control

primary 1 30gb 16 2tb low Holds the reporting application, Zookeeper server management, and system monitoring tools. May hold SFTP server as well.Disk space is for holding 2yrs of log data for reporting functions.

Reporting, Monitoring, Server Control

failover 1 30gb 16 2tb low Failover server for admin functions.Disk space is for holding 2yrs of log data for reporting functions.

Note:

“Cnt” is the number of servers for the specified configuration

RAM, Cores, and Hard Disk are “per server” values.

Database recommendations come from this sizing guide.

24

Page 25: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Hard Disk Drive Provisioning

The hard disk numbers above identify different IOPS (I/O Operations Per Second):

high – 1000-2000 IOPS

medium – 250-500 IOPS

low – 100-250 IOPS

2.1.2.1 Disk Space for Database Servers

The following spreadsheet provides a very rough estimate of the disk space required for the users and annotations table.

Notes:

Requirements estimate 1,000,000 users. Estimates for number of transcriptions, translations, tags, etc, are based on this estimate.

Number of bytes of data per row and for indexes are estimated based on current table designs

A multiplier of 4x is provided for expansion in the MySQL INODB database structure.

Bytes per row

Data Indexes Count Total (gb)transcriptions 443 100 1,000,000 2.02 translations 443 100 500,000 1.01 tags 148 50 10,000,000 7.36 comments 328 50 2,000,000 2.81 annotations log 563 200 15,600,000 44.25 accounts 274 100 1,000,000 1.39 Total 58.85

Since these estimates are very rough, a total disk space of 250gb per server is recommended to provide a 5x buffer for growth in requirements or mis-calculations.

2.1.2.2 Disk Space for Content Processing Servers

The content processing servers will maintain local cache files for the following:

All original DAS XML files

o Total number of DAS XML files expected: (initial configuration)20 million (descriptions) + 10 million (authority records) = 30 million total

o Average DAS XML size: ~10kb

25

Page 26: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

o Total required disk space: 10,000 * 30,000,000 = 300gb

A copy of all ARC XML files

o Total number of ARC XML files expected: (initial configuration)20 million (descriptions) + 10 million (authority records) = 30 million total

o Average ARC XML size: ~10kb

o Total required disk space: 10,000 * 30,000,000 = 300gb

Database cache of parent records and counts

o Parent records: 20,000,000 * 0.25 = 5 million

About 25% of DAS descriptions are a parent record

o Authority Records: 10 million

o Total records in the database cache: 10 million + 5 million = 15 million

o Expected bytes per record: 1000

o Total diskspace required: 15,000,000 * 10000 = 15gb

Total disk space estimated: 300gb + 300gb + 15gb = 615gb

Recommended disk space: 2tb to account for expansion of requirements and unexpected growth

2.1.2.3 Search Engine Sizing

Size per Index Entry

Each index entry is relatively verbose:

Entire ARC XML description = 10K bytes / entry

Technical metadata for each digital object = 1K bytes / entry

Extracted text content = 4K bytes / entry

o Note that even though most entries are images (with no extracted text content), PDF files are typically provided which contain OCR text for all of the images.

o Therefore, the average of 4K bytes / entry holds

Additional metadata fields: 2K bytes / entry

Total size: 17K bytes / entry

Note: These index sizes are substantially larger than the current OPA Pilot system because the full XML description and full object metadata XML is indexed with every index-entry, as is required to handle the API use cases discussed with NARA.

26

Page 27: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Index Entries / Node

The general consensus for Search Engine indexes are:

Small Documents (1K-5K / entry): 50 million index entries / node

Medium Documents (5K-50K / entry: 25 million index entries / node

Large Documents (50K- / entry): 10 million index entries / node

Therefore, Search Technologies recommends sizing each machine at around 25 million index entries on each node.

Total number of index entries

Total number of records to be indexed into NARA Catalog:

20 million archival descriptions

10 million authority records

500 million digital objects

Total: 530 million index entries

Total Number of Servers

Based on the above estimates, the total number of servers recommended will be:

530,000,000 / 25,000,000 = 21 servers

Rounding up = 25 servers

Two replicas for query performance and scalability

Total servers: 25 * 2 = 50 servers

Index Space

The storage required for each search engine server is computed as follows:

Total data content = 530 million entries * 17K bytes / entry = 9.1tb

Index content = 9.1tb (same as content size)

Total disk required: 18.2tb

Round up: 20tb

Disk per server: 20tb / 25 servers = 800gb / server

Recommended disk space per server: 1tb / server

27

Page 28: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

2.1.2.4 Disk Space for Application Servers and Reporting Servers

In order to provide the reports required by the NARA Catalog reporting requirements, every API access will be recorded in log files.

An estimate of API accesses include:

60 queries / second “sustained” usage (Rqmt 10.4)

“Normal traffic” of 2,000 concurrent users (Rqmt 10.6)

o Assuming API calls of 1 call per every 10 seconds per user provides 200 API calls / second

Total: 260 API calls / second

Assuming each API call requires about 100 bytes

Disk Usage: 26,000 bytes / second = 2.1gb / day of logs generated

Hold 1 year worh of logs: 2.1gb * 365 = 766.5gb of logs = 1tb of disk space (rounded up)

2.1.3 NARA Catalog Storage Hardware

NARA Catalog Storage requirements can be computed in several ways:

Requirement 12.1.1: 10,000tb of storage (10 petabytes)

Compute space for 500 million digital objects (requirement 12.1.2.1):

o Current space = 8.4tb for 1.6 million digital objects

o Scaling up: 8.4 * 500 / 1.6 = 2,625 tb (2.6 petabytes)

Current space for all images + digitization partner images: ~85tb

2.1.4 Network Hardware

Network hardware required includes:

Load balancer: Internet Application Servers

o Load balance requests from the internet across 4 application servers

o The load balancer to the internet will provide boundary control to external systems (Rqmt 13.1.44, 13.1.44.1)

o It will be configured to only allow minimum protocols (Rqmt 13.1.44.2), namely HTTP and HTTPS to the application server.

o The external iCMP ‘echo’ request will be ignored (Rqmt 13.1.44.3), as will external UDP ‘chargen’ requests (13.1.44.4)

28

Page 29: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Router / Firewall

o For ingesting new content via SFTP (push)

o For registering new updates from DAS (pull)

o The router to NARANet will provide boundary control to NARANet systems (Rqmt 13.1.44, 13.1.44.1)

o It will be configured to only allow minimum protocols (Rqmt 13.1.44.2), namely HTTP / HTTPS to/from DAS, and SFTP to NARA Catalog storage.

o The external iCMP ‘echo’ request will be ignored (Rqmt 13.1.44.3), as will external UDP ‘chargen’ requests (13.1.44.4)

2.1.4.1 Network Layout

The recommended network layout is shown in the following diagram:

Private Sub-Net APrivate Sub-Net A

Search ServersPublic Sub-NetPublic Sub-Net

Application ServersDatabase Servers OPA Storage

Content Processing Reporting andAdmin Control

Private Sub-Net BPrivate Sub-Net B

AWSNetwork

DAS

FTPClient

Internetgateway internetgateway

CustomRouteTable

CustomRouteTable

CustomRouteTable

Router

With the above architecture, routes are carefully controlled to provide as much isolation from internet traffic as possible. In the above diagram the arrow represents “allowed inbound traffic”. Arrows for SSH for system administration are not shown.

The specific routes and network-security required will include:

1. Internet traffic application servers (HTTP / HTTPS).

2. Application servers Private Sub-Net A:a. Access to NARA Catalog Database Servers (Read/Write)

b. Access to Search Servers (read only)i. The search servers will be configured to limit application servers to the “/select”

URL path.

29

Page 30: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

c. Access to NARA Catalog Storagei. Configured using NFS mount on the Application Servers.ii. Application Servers Digital Objects (read only)iii. Application Servers Bulk-Export Area (read/write)

3. Content Processing / Reporting & Admin Control Search, Database, NARA Catalog Storagea. The servers in private subnets A and B will be able to access each other as needed:

i. Database Content processing (read only)ii. Content Proessing Search (write only)iii. Content Processing NARA Catalog Storage (read/write)iv. Reporting & Admin Control All Servers (read/write)

4. Servers in NARANet will need to be connected to select servers in NARA Catalog:a. DAS Content Processing (read only)

b. SFTP NARA Catalog Storage (read/write to select directories)

2.2 Sandbox EnvironmentThe following is the system diagram for the sandbox system:

OPA Sandbox

Search Engine ArrayContent Processing

1

4

ApplicationServers

1

Storage

OPAStorage

Internet

DAS

FTPClient

routerrouterrouterrouter

Internet

See section 2.1.2 above for details on the types of server machines required.

Specific requirements on RAM and number of processing cores per server for the sandbox environment are identified below:

System Purpose Cnt RAM Cores HD IOPS Comments

Content Processing

primary 1 30gb 8 2tb med Content processing & SFTP server.

Search Engine

primary 4 60.5gb

16 1tb high Solr Search Search servers for 100 million index entries.

30

Page 31: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Application primary 1 30gb 16 100gb low Holds application servers to handle API requests from end-user interfaces.

The details of the sizing and disk space required per machine are the same as in section 2.1.

2.3 Development SystemThe system diagram for the development system is shown below:

OPA Development

Search Engine Array

Content Processing

AWSNetworkDAS

Database ApplicationServers

loadballance

loadballance

Internet

Reporting, Monitoring,Admin Control 1

1

1

Storage

OPAStorage

routerrouter

2

4

Specific requirements on RAM and number of processing cores per server for the development environment are identified below:

System Purpose Cnt RAM Cores HD IOPS Comments

database primary 1 122gb 16 250gb med MySQL Server

Content Processing

primary 1 30gb 8 2tb med Content processing & SFTP server. Note that two servers provide capacity.Failure of one server will reduce ingestion capacity.

Search Engine

primary 4 60.5gb

16 1tb med Solr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs dictate.

Application primary 1 30gb 16 500gb low Holds application servers to handle API requests from end-user interfaces.

Application failover 1 30gb 16 500gb low Additional application server for testing session data persistence

31

Page 32: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NEW_UAT

PROD OLD_PROD

PROD

Release

Two systems are required for only for a 3 month window around releseae.

Launch NEW_UAT Environment

Start UAT Test

PROD burn-in complete. Shut down OLD_PROD

NARA Catalog System Design

across multiple servers.

Reporting, Monitoring, Server Control

primary 1 30gb 8 1tb low Holds the reporting application, Zookeeper server management, and system monitoring tools. May hold SFTP server as well.

The details of the sizing and disk space required per machine are the same as in section 2.1.

2.4 UAT SystemFor each new release of NARA Catalog, a UAT system will be required. It is recommended that this system be substantially the same as the production system shown above in section 2.1.

If a true, elastically scalable cloud environment is available, Search Technologies recommends provisioning the UAT system only “as needed”, around major release dates. This is shown in the following diagram:

2.4.1 UAT to PROD Proceedure

The recommended process for fielding a new UAT system is as follows:

5. Two months before “go live”, launch a new set of virtual machines in the configuration shown in section 2.1 NEW_UAT

a. This system should be the same configuration as shown in 2.1.

6. Deploy a completely new version of NARA Catalog to the NEW_UAT system.

7. Migrate data to NEW_UAT as needed.

32

Page 33: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

a. Restore backups to NEW_UAT.

b. Reprocess updates since backup was made.c. This should *not* require a new copy of NARA Catalog Storage.

i. Instead, NEW_UAT will operate on a “test packages” area.ii. Packages will be copied from the production area and modified as necessary.

8. Complete UAT test on NEW_UAT.

9. When the new system is ready to go live to production, perform a final system validation:

a. Complete a final backup restoreb. Reprocess updates since backup was made

c. Complete a final system validation test

10. Put NEW_UAT online.

a. Route requests from http://catalog.archives.gov Now goes to NEW_UATb. NEW_UAT now becomes PROD

c. PRODUCTION now becomes OLD_PROD

11. Monitor and validate PROD to ensure smooth operation.

12. If there is a fatal problem with NEW_UATa. Restore: OLD_PROD PROD

b. Fix the problem.c. Return to step 4 above.

13. Once PROD (formerly NEW_UAT) is safe and running smoothly (past the burn-in period):a. Shut down OLD PROD.

b. Release the virtual machines back to the cloud.

2.5 Example 2014 and 2015 NARA Catalog Prod ComputationsThis section covers an example of how 2014 and 2015 server requirements could be computed.

Note: This information is based on data sets known to Search Technologies which will need to be migrated into NARA Catalog Production in 2014 and 2015.

Naturally, Search Technologies is not aware of all of the potential data migrations into NARA Catalog Production which are planned for 2014. Therefore, this section can be viewed as merely as an example of how server requirements could be scaled down should 2014 and 2015 be less than as specified in the NARA Catalog requirements spreadsheet.

2.5.1 Example Server Requirements

The current OPA Pilot system has the following characteristics:

33

Page 34: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Current digital objects: 1.6 million

Current archival descriptions: 8 million

Current authority records: 1.05 million

Expected growth for calendar year 2015:

EOP Packages: 360,000 (15,000 messages x 24 months)

Digital partner objects: 12 million

Based on the above information, Search Technologies believes that 50 servers for search may overestimate the requirements for NARA Catalog Production for Calendar years 2014 and 2015.

Based on the above requirements, the total number of index entries could rise to:

Total index entries (based on above estimates):

o Archival descriptions: 8 million + 25% = 10 million

o Authority records: 1.05 million + 25% = 1.35 million

o Digital objects: (1.6 million + 25%) + 12 million = 14 million

o EOP Packages: 360,000 x 2 (for description and digital object) = 0.72 m

o Total index entries: 26 million

To handle 26 million index entries, the following server counts could be modified:

Reduce search engines: 50 4

Additional servers may be reduced depending on the rate of adoption of NARA Catalog Production:

How much will the APIs be used?

How many simultaneous users will NARA Catalog Production have in 2014 and 2015?

Current OPA Pilot usage is relatively light (0.1 QPS average, 7 QPS peak). Given historical usage, the Application servers may be reduced to 2 (from 4) and the bulk export server may be co-located with the “Reporting, Monitoring, and Server Control” Servers, leading to future reductions in hardware requirements for 2014 and 2015.

2.5.2 Elastic Scalability

Depending on the time it takes to provision new hardware (a function of the cloud environment), reducing hardware for 2014 and 2015 could be a relatively “safe” option, for the following reasons:

Adding search server rows for additional QPS will require about 2 weeks

o Machine instances can be created to launch servers quickly.

34

Page 35: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

o Servers can be added as new “slave replicas” with simple configuration

o 2 weeks would be required for initial index replication, testing, and to account for possible roll-backs and re-attempts should something go wrong.

Adding additional index partitions for additional content will require about 6 weeks

o Machine instances can be created to launch servers quickly.

o Servers can be added as new “partitions” with simple configuration

o 6 weeks would be required to re-balance the documents across the partitions, which may require

Adding new application servers for additional end-user capacity will require about 3 days

o New application servers can be added at any time

o No complex data replications are required (they all share a master database)

o All API and UI transactions are stateless (state is carried in cookies and on the client)

Note that these times (2 weeks for a search row, 6 weeks for additional index partitions, 3 days for additional application servers) could be reduced with additional testing, scripting, and process documentation.

2.5.3 Unknowns

There are a number of unknowns in the calculations above which could cause the systems for 2014 and 2015 to be substantially larger. Specifically:

Will all of AAD be indexed as granules? 105 million records

o Note: Depending on API requirements, these could be “small” records.

o Many more “Small” records can be packed into a single server (as many as 50 million, instead of the 25 million recommended for “medium” records)

Will every name in the 1940 census be indexed as granules? 130 million records

o Note: Depending on API requirements, these could be “small” records. See above.

What other major initiatives will be required?

2.5.4 Computing Server Requirements for Index Entries of Varying Size

For index partition computations, a general understanding of the size of the index entry is required. For the purposes of NARA, documents can be classified as “small”, “medium”, and (possibly) “large”, as follows:

35

Page 36: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Small

o A single row from a database table

o A half-page of text

Medium

o Anything with <archival-description> XML is automatically medium or larger

o XML metadata for multiple objects

o One or two pages of text

Large

o Over 25 pages of text

Computations involving “small”, “medium”, and large index entries should be based on the following formula:

Index partitions = (number of small entries)/50 million + (number of medium entries)/25 million + (number of large entries)/10 million

Total servers = (index-partitions) * (replicas)

Currently we expect replicas = 2 to handle the QPS rates required by NARA Catalog.

Note: This formula only works if the small entries are truly small. For example, indexing the entire <archival-description> XML with every small entry will automatically turn all index entries to “medium” size.

For example, if all of AAD and if all names in the 1940 census are indexed as “small” entries, then the following computations hold:

Medium entries (from previous sub-sections): 26 million

Small entries (from AAD & 1940 census): 234 million

Index partitions: (26 million/25 million) + (234 million / 50 million) = 5.72

o Round up: 6

Total search engine servers: 6 * 2 = 12

36

Page 37: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

3 Operating System Design

The operating system recommended for NARA Catalog is Red Hat Linux – or similar. Red Hat is on the NARA TRM as a recommended Linux variant.

3.1 Kernel ConfigurationThe kernel configuration will be as delivered. No kernel customizations are required.

3.2 Memory ConfigurationThe initial memory configuration will be based on default values for Linux.

Optimal kernel memory parameters (such as shmmax, file-max, swappines, Huge memory pages, etc.) will be determined based on search engine and MySQL performance tuning, as needed.

Modified parameters as needed to achieve required performance will be documented in the administration guide.

Parameters required for parity checking (Rqmt 13.1.20.8) will be configured as well. [TBD – Requires help from NARA security team to determine correct parameters]

3.3 AccountsUnix accounts will be managed as follows:

Guest accounts for COTS will be disabled. (Rqmt 13.1.1)

Separate accounts for server processes (Rqmt 13.1.7, 13.1.42)

Separate accounts for operating system account management (Rqmt 13.1.7, 13.1.42)

Login attempts will be limited to a maximum of 5 consecutive invalid attempts by a user during a 15 minute period (Rqmt 13.1.8)

o The NARA Catalog system shall automatically [locks the account/node for at least 15 minutes ] when the maximum number of unsuccessful attempts is exceeded. (13.1.8.1)

The NARA Catalog system shall display an approved system use notification message before granting access. (Rqmts 13.1.9, 13.1.9.1, 13.9.1.2, 13.1.9.3, 13.1.9.4, 13.1.9.5, 13.1.9.6, 13.1.9.6.1)

o [TBD – need “approved system use notification message” from NARA for linux accounts]

37

Page 38: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Enforce minimum password rules, including:

o Case sensitive, 8-character mix of upper case letters, lower case letters, numbers and special characters including at least one of each (Rqmt 13.1.38)

o Enforce at least a four character change when new passwords are created (Rqmt 13.1.38.1)

o Require password encryption (means requiring SSH for system access) (Rqmt 13.1.38.2)

o Enforce password minimum and maximum lifetime restrictions (1 day minimum, 90 day maximum) and prohibit password reuse for a minimum of 5 generations (Rqmt 13.1.38.3)

o Shall use FIPS 140-2 standards for cryptographic modules (Rqmt 13.1.40)

3.4 AuditingAuditing will be done with the help of the NARA security team, and based on NARA Linux recommended configurations.

We anticipate that this will include:

Administrator auditing with the “psacct” package and perhaps other packages (such as rootsh logging) (Rqmts 13.1.10, 13.1.11)

Appropriate configuration of syslogd, including auditing of successful and unsuccessful account events (Rqmts 13.1.10, 13.1.12)

Verification of logs generated to /var/log/security and /var/log/audit/audit.log

Protection of logs from unauthorized modification (Rqmts 13.1.19, 13.1.19.1)

Capturing operating system errors (Rqmt 13.1.20.4)

[TBD – Details will require standard recommended configurations from NARA security team]

3.5 Ports ConfigurationAll ports will be initially “turned off” for all servers. Then ports will be individually turned on as required for inter-process and external communications to ensure that only the minimum number of ports are enabled. (Rqmts 13.1.22)

3.6 Clock SynchronizationThe Linux systems will be configured for internal system clocks for auditing, and for synchronizing clocks with NARA’s authoritative time source [TBD – what is NARA’s authoritative time source?]. (Rqmts 13.1.17, 13.1.18)

38

Page 39: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

3.7 SSHOnly SSH (Secure Shell) will be allowed into NARA Catalog servers for system administration.

3.8 Maintaining and Patching the Operating SystemThe operating system will be maintained and patched using the cloud-recommended procedures.

This may involve:

Halt updates to the system.

o For example, turn of index updates by the ingestion servers.

o This will limit the amount of data which is changing. Most servers will now have “idle” systems with files that do not change.

Taking the server to be patched off-line.

Patching the operating system as necessary.

Bringing the server back on-line.

Re-synchronize database files as necessary.

o If the ingestion servers are idle, then search engine indexes, application servers, and ingestion servers will not require re-synchronization.

o Therefore, only the RDBMS may still be receiving updates that require synchronization, when a database server is taken off-line, patched, and then brought back on-line.

All critical servers have fail-over siblings which will allow for either one or the other to be brought off-line for patching as necessary.

39

Page 40: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

4 Storage Design

This section covers the design of “NARA Catalog Storage”.

This section does not cover the disk required for each individual server (see section 2.1 for more information about individual server disk).

4.1 Storage Technology for NARA Catalog ProdThe storage technology to be used for NARA Catalog Prod will need to be discussed with the cloud provider. The following are expected storage technologies based on the provider chosen.

4.1.1 Version 1

It is expected that the technology of NARA Catalog storage for NARA Catalog Prod, version 1 will be a simple, mounted disk drive.

The underlying technology will depend on the cloud environment chosen:

For FDC – this will be a NetApp disk

For Amazon Cloud – this will be Elastic Block Storage (EBS)

For other cloud systems – this will be standard mounted disk drives.

Note: Is using a cloud system, an additional server may be required to service NFS requests from all other NARA Catalog server machines.

The exact storage mechanism will need to be discussed with the cloud provider – once this is determined by NARA.

Depending on the cloud environment and the storage options provided, additional tasks may be required to achieve the high IOPS required by the database and the search engine servers for NARA Catalog. This may include:

Striping the disks for higher I/O Performance

Having many separate volumes instead of a small number of very large volumes

Using different types of disks (i.e. “Provisioned” storage vs “Standard” storage)

The exact configuration and disk mounting steps will be determined once the cloud environment and storage technology are determined.

40

Page 41: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Server Access to NARA Catalog Storage

NARA Catalog Storage will be NFS mounted to all of the servers that require access to it. This includes:

Ingestion servers (read/write)

Application servers (read/write)

Reporting and server management servers (read only access)

Bulk export server (read/write)

For more fine-grained security controls, see below.

Note that the search engine servers and database servers will not require access to NARA Catalog Storage.

4.1.2 Version 2

Again, depending on the cloud provider, NARA Catalog will use a shared high-volume cloud storage technology, specifically, Amazon S3.

Amazon S3 has a better price per terabyte than mounted disk drives.

However, this will need to be deferred to version 2 for the following reasons:

Cloud storage providers are accessed through custom RESTful interfaces

o This requires additional programming for reading and writing every file to the storage system.

The performance metrics are unknown

o Additional benchmarking and performance testing will be required

Additional management and monitoring tools may be required

41

Page 42: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

4.2 StructureThe structure of NARA Catalog storage is shown in the following diagram:

Dev / Test Sandbox Production Backup

Mount Points / Top Level Directories

Second level directories

OPA-IP Directories

Digitization Partner / Future Projects

Environments

Directory per partner / future project

Pre-Ingestion

Directory per project /Responsible entity

SEIP Directories, possibly compressed and bundled

The sub-directores are as follows:

/opa/

bulk/ Holds bulk-export files

<export-files> Currently holds bulk-export files. May be divided into multiple mount points later if the size / quantity of the bulk exports require it.

dev/ Holds content for the development environment

<naid-directories> See section 4.2.2

future/ Embargoed & digitization partner content (R-2.3.3.2)

<project-directories> Every project has a separate directory / mount

pre/ The pre-ingestion area, a holding area for new content

updates/ Used to hold new digital objects to ingest

quarantine Holds quarantine packages updates (R-2.10, R-2.11)

eop/ Every project has a separate directory / mount

quarantine Holds quarantine packages eop packages(R-2.10, R-2.11)

<other-projects>/ Every project has a separate directory / mount

quarantine Holds quarantine packages for the specified project(R-2.10, R-2.11)

42

Page 43: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

prod/ Holds content for the production environment

<naid-directories> See section 4.2.2

sandbox/ Holds content for the sandbox environment

<naid-directories> See section 4.2.2

4.2.1 Project Directories

Project directories will be created as one per project.

For the “future” directory

o Project directories will be for different digitization projects / partners

o This may contain embargoed data

o Access controls will be based on the individual project needs:

Full access to the project manager and designates- Access may be revoked once the project is “complete”

Full access to system administrators

No access to any NARA Catalog server process

For the “pre” directory

o “eop” – Holds new SEIPs from the EOP system

“quarantine” – packages which fail are copied to quarantine.- The reason for the failure will be in the log files.

Access controls:- Full access for systems and users which produce EOP SEIP packages

- Full access to system administrators

- Full access to NARA Catalog ingestion servers

- No access to other systems or individuals

o “updates” – Holds partial NARA Catalog-IP directories for updated digital objects.

Directories will be named with the description “naid”, and contain:- objects.xml – An XML file describing the object ID, object information,

and files for each object ID.

- content – A sub-directory holding the actual content files.

“quarantine” – Pre-ingestion packages which fail are copied to quarantine.

- The reason for the failure will be in the log files.

Access controls:

43

Page 44: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

- Full access for systems and users which produce new digital objects

- Full access to system administrators

- Full access to NARA Catalog ingestion servers

- No access to other systems or individuals

o <other projects> - Other directories may be created as necessary to handle additional data flows for new projects.

For example, a new pre-ingestion project directory will be created for each digitization partner project

Access controls:- Full access to the project owner

- Full access to system administrators

- Full access to NARA Catalog ingestion servers

- No access to other systems or individuals

What’s a Mount Point?

At this juncture, without knowing the cloud environment and without a definitive answer on the storage technology to be used, it is impossible to know what project directories will be separate mount points.

Tentatively:

Every project directory inside “future” will be a separate mount point

o It is expected that each of these directories will represent a significant amount of data.

o Currently, there are 64tb of data in the “future” directory in OPA Pilot.

The entire “pre” directory could be a single mount point.

o Since this is a transient directory, it is not expected to require much disk space.

o However, some of the <other-projects> may need to be separate mount points (TBD) depending on the amount of data and whether content is provided all at once, or in batches.

4.2.2 NAID Directories / Separate Environments

Dev/Test, Sandbox, Production, and Backup will all have NAID based directory structures. The NAID will be the same NAID for the object as specified in DAS for the associated archival description.

The directory structure will be as follows:

Level0: Environment (“dev”, “prod”, “sandbox”)

44

Page 45: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Level1: NAID mod (numLevel1Dirs)

o Each environment will have a configured number of “Level1” directories.

For the first production release this will be 10

Maximum anticipated is 100

o Depending on the type of storage, each top-level directory may be a different mount-point.

Level2: (NAID/(numLevel1Dirs)) mod 10000

o Each level2 directory will contain up to 10,000 sub-directories.

Level3: des-NAID

o The entire NARA Catalog-ID will be used as the directory name for the NARA Catalog Information Package.

For example, for the NAID 5541536, the following levels would apply:

5 5 4 1 5 3 6Level 1

Level 2

Level 3

And the final directory would be: /opa/prod/36/5415/des-5541536

The purpose of using the lowest digits for level1 and level2 is to allow for a random distribution of files amongst those levels, so that a single directory path will be less likely to grow at a larger proportion than other directories.

Total size of the storage, assuming:

Level1 directories = 100

Level2 directories = 10,000

Level3 directories = 10,000

…is 10 billion packages (e.g. 10 billion descriptions).

Access Controls

The access controls for each of the different environments will be:

45

Page 46: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

“dev”

o Full access – All NARA Catalog staff (developers, system administrators, testers, etc.)

o Full access – All development server accounts

“prod”

o Full access – Ingestion server accounts

o Full access – Application server accounts

This is required to write transcriptions and translations into NARA Catalog Storage.

This may be changed to read/write, depending on whether or not application servers need to create sub-directories inside of NARA Catalog-IPs [Design TBD]

o Read Access – Reporting, monitoring, and server management accounts

o Read Access – Bulk-export server account

o Full access – System administrators

“sandbox”

o Full access – Sandbox ingestion server accounts

o Read access – Sandbox application server accounts

o Full access – System administrators

4.2.3 SFTP Server Access

SFTP servers will be installed on the content processing servers. SFTP servers will have access to:

/opa/pre – The pre-ingestion area

/opa/future – The future projects / digitization partner / embargoed data area

Notes on SFTP configuration:

Anonymous SFTP access must be disabled.

All users will require an account on NARA Catalog in order to upload content via SFTP.

SFTP will only be available from NARANet.

Access to the /opa/pre and the /opa/future directories will be configured with operating system access control as described in the previous sections.

46

Page 47: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

5 Backups & Recovery

This section covers backup and recovery methods.

5.1 BackupsNote that backups are required for the production environment only.

5.1.1 Backup Schedules

The following backup strategies will be required:

MySql Databases

o Daily incremental backup

o Weekly full backup

Search engine index files

o Daily incremental backup

o Weekly full backup

Content Processing / Ingestion servers

o Weekly full backup of cache files

Application Servers

o Weekly copy of log files to the reporting servers.

Reporting, Monitoring, and Admin Control

o Weekly incremental backup of log files

5.1.2 Backup Details

Details on the backup mechanism for each type of server will be outlined in the individual design documents:

For search engines: NARA Catalog Search Engine Design

For application servers: NARA Catalog Application Server Design

For the MySQL Database: NARA Catalog Application Server Design

For the Reporting, Monitoring and Admin Servers: NARA Catalog Reporting Design

For content processing: NARA Catalog Ingestion Design

47

Page 48: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

Note that all backup scripts and processes will be fully documented in the Administrator Guide when the system is delivered.

Backups of COTS software and system configuration files will be done after every deployment. This will be documented in the Deployment Guide.

5.1.3 Backup Storage

Backup Storage will be implemented with Amazon AWS “Glacier” storage.

5.1.4 Backup for NARA Catalog Storage

In order to meet up-time and recovery requirements for site-wide disaster scenarios, all of NARA Catalog storage will need to be backed up. Further, the backup should be done off-site.

Due to the size of NARA Catalog storage, the backup method will depend on the cloud environment chosen for NARA Catalog. The method of backup will be determined after consultation with NARA and the cloud provider.

5.2 Recovery from Server FailureThe recovery method will depend on the type of server.

5.2.1 Database Servers

A primary and a failover exist for the MySQL database servers.

A failure of either server will mean operating with a single server until its sibling server can be restored and the database mirrored.

See the NARA Catalog Application Server Design for details on the database recovery process.

5.2.2 Content Processing / Ingestion Servers

Recovery times for content processing servers are longer (2 days) than for other servers (90 minutes). Therefore, the recovery process for content processing will be:

1. Launch a new virtual machine for the server

2. Deploy the appropriate software to the server

a. Steps 1 & 2 could be combined if a “machine images” of the server is saved as part of the deployment procedure.

b. Note that maintenance of machine images are not specified as requirements.

3. Copy the appropriate backups to the server.

4. Reprocess updates since the last backup was saved.

48

Page 49: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

See the NARA Catalog Ingestion Design for details on the ingestion server recovery process.

5.2.3 Search Engine Servers

A primary and failover server exists for each search engine server.

Therefore, a failure of either server will mean operating with a single server for the specified index partition until its sibling server can be restored and the index copied.

See the NARA Catalog Search Engine Design for details on the database recovery process.

5.2.4 Application Servers

Application servers can be recovered at any time by simply launching a new instance of the server and adding it to the server farm. No recovery of backups is required.

See the NARA Catalog Application Server Design for details on the database recovery process.

5.2.5 Reporting, Monitoring & Admin Control

A primary and a failover exist for the MySQL database servers.

A failure of either server will mean operating with a single server until its sibling server can be restored and the backups recovered.

See the NARA Catalog Reporting Design for details on the recovery process for reporting functions.

See the NARA Catalog Search Engine Design for details on the recovery process for Zookeeper.

5.3 Recovery from Site FailureRecovery from site failure will require the following steps:

1. Launch new copies of all server instances.

2. Restore all databases from the latest backups.

3. Reprocess updates since the latest backups were required.

4. Re-index records as required.

It is conceivable that a complete re-index of all NARA Catalog content will be required to recover from a site-failure.

If this is the case, then multiple content ingestion servers may need to be launched to reprocess records in parallel, to perform a complete re-index within 7 days.

49

Page 50: Search Technologies Assessment · Web viewSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs

NARA Catalog System Design

6 System Monitoring

System monitoring will be performed using Amazon CloudWatch monitoring services.

50