12
ONS Methodology Working Paper Series No 1 ONS Innovation Laboratories Owen Abbott December 2014

MWP1 ONS Innovation Laboratories

Embed Size (px)

Citation preview

Page 1: MWP1 ONS Innovation Laboratories

ONS Methodology Working Paper Series No 1

ONS Innovation Laboratories

Owen Abbott

December 2014

Page 2: MWP1 ONS Innovation Laboratories

Contents 1. Introduction .................................................................................................................................... 3

2. History ............................................................................................................................................. 4

2.1 The need for computing facilities to drive innovation ............................................................ 4

2.2 Openstack ............................................................................................................................... 4

Figure 1 – Openstack structure overview ....................................................................................... 5

2.3 Building the Openstack environment ..................................................................................... 5

Figure 2 – The Titchfield Lab Servers .............................................................................................. 6

Figure 3 – The Titchfield Lab Terminals .......................................................................................... 6

Figure 4 – The terminals in the Newport Lab ................................................................................. 7

3. Lab architecture .............................................................................................................................. 7

4. Lab management and use ............................................................................................................... 8

4.1 Management ........................................................................................................................... 8

4.2 Projects ................................................................................................................................... 9

4.3 Documentation and Information sharing ............................................................................... 9

Figure 5 – Screenshot of Innovation Lab DokuWiki homepage .................................................... 10

5. Conclusions ................................................................................................................................... 10

References ............................................................................................................................................ 11

Annex A -Hardware list for the Titchfield Lab ....................................................................................... 11

Annex B -Openstack Environment Specification................................................................................... 11

Page 3: MWP1 ONS Innovation Laboratories

ONS Innovation Laboratories

1. Introduction This paper describes the Innovation Laboratories that are being used by the Office for National Statistics (ONS) to support its strategic aim to be at the forefront of integrating and exploiting data from multiple sources, and to support its aspiration to be perceived as among the best, most innovative statistical offices in the world (ONS, 2013). Big data are being used more and more to produce statistics in the wider world and ONS is engaging with emerging data, methods and tools. This requires access to an environment which can be used to explore these new developments, and permit analysis of the potential for them to be used in the production of official statistics. The Innovation Labs are a resource for learning, research and innovation which provide technologies which are not available on the standard ONS secure network. They are stand-alone networks of high specification computers which are not connected to the ONS network, but have full internet access. They have been built as a full private cloud computing facility1 which allows users control over processing power, operating system and software depending on their requirements. The cloud environment is built using open-source software. The Labs were set up in response to the challenges of the ONS strategy (ONS, 2013). To deliver its aims, ONS recognised the need to:

explore how to gather and use big data;

explore the use of alternative data sources in producing statistics;

move towards more use of open-source software;

use technology to support the business; and

develop our skills and innovation abilities.

Another outcome from the ONS strategy was the establishment of the Big Data project (Naylor et al, 2014a) in late 2013 to:

investigate the potential advantages that big data provides for official statistics;

understand the challenges with using these sources;

establish an ONS policy on big data and longer term strategy incorporating ONS’s position

within Government and internationally in this field; and

make recommendations on the best way to support the ONS strategy on big data beyond

the life of the project.

A key component of the project was to include some practical applications of big data to both assess the role they may have within official statistics and also to help understand the methodological and technical issues that may arise when handling them. The Labs have supported these pilot projects, the Big Data team being the primary user group. This paper provides a brief history of how the laboratories were instigated and built, and then sets out the IT architecture that was deployed. It summarises how the Labs are managed, and discusses some of the projects that have been undertaken using the lab environment. Lastly, the lessons learnt and future plans for the Labs are described.

1 A description of a private cloud can be found at http://www.interoute.com/cloud-article/what-private-cloud

Page 4: MWP1 ONS Innovation Laboratories

2. History

2.1 The need for computing facilities to drive innovation There has been a long standing need within the ONS Methodology group for a “sandpit” environment which would allow the exploration of new statistical software, and would also provide a high performance computing capability. In the days when ONS had an in-house IT support function, it was possible to specify and obtain high performance workstations for particular purposes, such as for simulating censuses, large scale spatial analyses or memory intensive multi-level modelling. However, with the move towards centralised IT functions and the outsourcing of IT support it became more difficult to repeat this approach, primarily because it was more expensive to introduce and support non-standard hardware. We considered the possibility of getting a sandpit environment scoped and built by our IT partner organization out of the box, but this was ruled out for three reasons:

Firstly, we did not have a clear picture of exactly what the environment would need to deliver. We therefore ran the risk of it not meeting our needs. The cost would also be very significant.

Secondly, changing and adapting the environment once delivered would be expensive and relatively slow, requiring a change management process that was not sufficiently agile for our needs.

Thirdly, a key objective was to build capability by understanding more about big data and data science tools and technologies. The former Methodology Group in particular has always fostered a strong culture of “learning by doing”, and it was felt that this would be the best way to build new capability.

In late 2011, the ONS Methodology group was merged with the ONS IT function. This reorganisation was designed to bring together statistical and IT experts in a single directorate to encourage collaboration between the two groups. As part of this, some statistical staff who were interested in computer science and technology began considering how to encourage collaboration and innovation. Following discussions in March 2013, the Director agreed to the creation of a pilot innovation laboratory at the ONS Titchfield site. The equipment was purchased through the usual IT contract, albeit on a supply only basis with no support and on the proviso that it was not to be connected to the secure ONS network. On arrival in November 2013, the methodology staff who specified the machines installed them on a bank of spare desks. They networked and connected them to a basic external broadband line which had been purchased. Windows 7 was installed on the machines to allow initial use of the lab facilities. This was followed by a pilot at the Newport site to mirror the Titchfield setup. The next stage was to explore setting up the Labs as a private cloud environment. The methodology staff who instigated the hardware purchases had identified that Openstack2 offered the best option in terms of cloud software, as it was widely supported and available across a number of platforms.

2.2 Openstack Openstack is a collection of open source software packages that can be used to create a cloud environment on top of a set of physical hardware. The environment provides the ability to run virtual machines - essentially a computer that isn't tied to a specific physical machine. These virtual

2 http://www.openstack.org/

Page 5: MWP1 ONS Innovation Laboratories

machines are fully customisable, so the end user can vary how powerful they are, how much storage they have, what operating system they use and what software is installed. Figure 1 shows the basic Openstack structure. Users define what they need through the dashboard, and Openstack provisions compute (processing power), networking and disk storage for the user. The user then accesses the provisioned virtual machine by either logging on directly, or in the case of a machine that has a pre-built web interface, using a web browser pointed at the machine’s network address.

Figure 1 – Openstack structure overview3

This type of environment allows a user to essentially work on a virtual machine of their choosing (provided we have a base image for the operating system required). The user can install whatever software they like and run any processes they desire, as well as store and access data on a shared storage facility. What makes cloud environments particularly powerful is that they are easily scalable (by adding additional physical machines). An end user is able to run multiple virtual machines for a variety of purposes, and this also allows them to take advantage of parallel processing technologies (e.g. Apache Hadoop, an algorithm based framework for processing large problems). A key additional advantage to ONS of setting up a private cloud environment in house is that those administering it will obtain a good understanding of the technology, its advantages and disadvantages.

2.3 Building the Openstack environment The methodology group staff attempted to build the environment themselves in early 2014, learning about Linux and the open source software required along the way. There are lots of ways of implementing an Openstack environment, and there are a number of websites that provide detailed instructions for doing so using open source software. After trying without success to use a package called Packstack4 to automatically provision Openstack, the Ubuntu5 distribution of Openstack was explored as this seemed to provide a more user friendly route that was relatively well documented. This had some success. The software used to control the hardware (MAAS6) was installed and

3 See http://www.openstack.org/software/

4 https://wiki.openstack.org/wiki/Packstack

5 http://www.ubuntu.com

6 http://www.ubuntu.com/cloud/tools/maas

Page 6: MWP1 ONS Innovation Laboratories

deployed, as was the software orchestration layer (Juju7) which could deploy the Openstack components. However, some issues with the Openstack setup prevented an operational stack from being completed. As some knowledge of the Ubuntu operating system (OS) had been obtained, and the quality of the software looked good, the decision was made to obtain some consultancy (through the government G-cloud procurement facility). This resulted in Canonical8 (the company which delivers Ubuntu) being engaged to provide 5 days consultancy with the aim of setting up a working Openstack implementation by the end. This worked very well. The consultants collaborated with ONS methodology and IT staff to complete the installation and provide a full guide to setting up and using the environment. This was completed in Titchfield in March 2014, and in Newport in June 2014. The ONS team was able to use this guide to rebuild the environments from scratch, and have learnt a huge amount working through this process. During the process of getting the Openstack environment working, the ONS team was able to acquire 20 ex-ONS workstation PCs which were due to be disposed. These machines were of 2008 vintage, and were being replaced due to the ONS rollout of Windows 7. The team deployed the desktop version of Ubuntu 12.04LTS onto these PCs for use as terminals for users to login and access Openstack through a web browser. Figures 2 and 3 shows the Titchfield Lab (on a standard bank of ONS desks), and Figure 3 shows the Newport Lab (on specifically ordered desking).

Figure 2 – The Titchfield Lab Servers

Figure 3 – The Titchfield Lab Terminals

7 http://www.ubuntu.com/cloud/tools/juju

8 http://www.canonical.com/

Page 7: MWP1 ONS Innovation Laboratories

Figure 4 – The terminals in the Newport Lab

3. Lab architecture The Labs are built using a mixture of high specification workstations (8 core processors, 64GB RAM) which act as servers and ex-ONS desktops/laptops which are act as the terminals through which users can either browse the web or access the Openstack environment. The Labs have their own broadband connection, network switches, printer and a storage server (with about 10TB storage) for the storage of data and user shares. The full list of hardware is provided at Annex A. In total, the equipment cost was in the region of £50k for each Lab. The environment is classed as a Business Impact Level Zero (BIL0) environment, as it has minimal security and is not accredited in any way. This essentially means that it can only hold publically available data, so unpublished ONS data cannot be held in the Labs. Higher BIL environments require increased security, protective monitoring and ultimately isolation from external attacks.

Page 8: MWP1 ONS Innovation Laboratories

The servers are running Ubuntu 12.04LTS (a flavour of Linux) as the operating system, and they host the Ubuntu version of Openstack (we are using the Havana release). This provides about 90 virtual CPUs, 700GB virtual RAM and 32TB of storage. The storage backend is provided by a distributed storage solution called Ceph, which distributes the data across physical disks and hosts such that if one fails, there is always a copy on another disk or host. The full specification of the Openstack environment is provided at Annex B. The desktop terminals use the desktop version of Ubuntu 12.04LTS (which provides a desktop environment and web browser). The Labs have a user authentication server which allows users to log on to any of the terminals and provides them with a home share (which is stored permanently on the storage server) where they can store their own files. The storage server hosts the Lab wiki (see section 4) and also provides a Virtual Private Network (VPN) server, which has allowed ONS staff in London to access the Labs from a standalone terminal there over the internet. It also allows the Lab administrators to be able to connect to Titchfield and Newport for maintenance. All software used to build the Labs is open source. This is software that is distributed under a license which allows anyone to view, modify or change it for any purpose. Such software has often been built by public developer communities. This means it is essentially free, but it develops according to the level of interest in it and has limited professional support. The following software packages beyond those noted in the installation description have been used in the Labs: Mozilla Firefox (web browser), Google Chrome, R (statistical software), Libreoffice (a clone of Microsoft office), QGIS, MySQL, PostgreSQL, DokuWiki, Python, Java and more. The operating systems that are available to be used in Openstack are Ubuntu server, Mint desktop, CentOS, CirrOS, Windows 2012 Server and Windows 7 desktop. Windows 7 is the only OS that requires a proper license, so the licenses that were purchased with the workstations on which the Labs are built are used.

4. Lab management and use

4.1 Management The Labs are supported by two Lab Managers, one for each site. IT support is provided by two system architects. None of these roles are full time (the Managers are 0.2 FTEs and architects are 0.1 FTEs), and were filled by interested parties who wanted to be involved in learning about the Lab technology and uses. There is also a small amount of administrative support for the Labs. The Labs are overseen by the Innovation Labs Oversight Group, chaired by the Big Data project lead. The group includes both Lab Managers and representatives from the methodology and IT divisions. This group meets monthly to discuss relevant developments, approve project proposals, monitor the Lab setup and communicate project outcomes. There is limited funding for the Labs. Whilst the Big Data project provided funding for the Lab setup, there is no specific funding for the projects carried out in the Labs. Staff wishing to use the Labs must therefore either be doing so as part of their funded work, or as part of self development. This has been somewhat of an issue with encouraging Lab use particularly from areas who are mainly customer funded and therefore do not have spare capacity (or money) for spending time on innovation projects. The oversight group strategy is to deliver some pilot projects which will demonstrate the value of innovation, and therefore encourage budget holders to consider allowing resource for such projects.

Page 9: MWP1 ONS Innovation Laboratories

4.2 Projects At the time of writing, only Methodology and IT staff have been able to use the Labs, with plans to open them to all of ONS in early 2015. In order to ensure that Labs are used appropriately, potential users have to complete a one page template outlining the purpose of the work they wish to carry out in the lab, what the benefits are, how long it will take and what their requirements are. Following Line Manager support, these projects are then submitted to the oversight group for approval before the users are provided with an account. For small projects that will require less than 2 hours or so (for instance viewing a webinar which cannot be accessed on the usual ONS network facilities), a project template is not required and a guest account can be used following approval from a Lab Manager. Since the Labs opened in early 2014, their use has increased over time. The Big Data project is the primary user, with approximately 5 staff in total using the Labs across both sites on an almost permanent basis. Progress on Big Data projects were reported by Naylor et al (2014a and 2014b). Beyond the Big Data specific pilot project, there have been 7 completed projects and at the time of writing there are 6 ongoing projects. In addition, there have been around 10 users who have used the Labs for individual learning, accessing webinars or for exploring websites not accessible over the ONS network. The more substantial projects included:

Obtaining, evaluating and analysing GP prescriptions data (open data available monthly for all of England) to see if there are correlations with levels of health from 2011 Census

Obtaining and evaluating House transaction data (address level open data available monthly from Land Registry) to evaluate its coverage and see if there are correlations with accommodation type information from 2011 Census at low geographical levels

Evaluating open source GIS software (QGIS and UDIG)

Setting up a multi-node Hadoop cluster and interface

Implementing a Windows 7 Desktop virtual machine for use in Openstack

How to carry out sentiment analysis using Twitter data in R, and whether it can help make predictions

Ongoing projects include:

Generating and damaging synthetic data using FEBRL (a record linkage tool)

Sentiment analysis of ONS outputs using Twitter data

evaluation of ArcGIS pro

Implementing an extended Hadoop environment

Analysis of large volume ONS flexible working data

Assessing the potential of Google Trends

4.3 Documentation and Information sharing Instructions and information for using the Lab is stored in a DokuWiki (see Figure 5) hosted on the Lab storage server. This Wiki contains an overview of the Labs, user instructions for how to use the Lab environment, what data is available and allows users to add content such as project outcomes, code or hints and tips. It also allows the Lab administrators to keep documentation about the Lab setup and what to do when there is a problem. The wiki has been made accessible to all ONS staff over the internet.

Page 10: MWP1 ONS Innovation Laboratories

Figure 5 – Screenshot of Innovation Lab DokuWiki homepage

5. Conclusions The ONS Innovation Labs are scalable environments where staff can experiment with new data, tools, technologies and methods safely without compromising ONS systems. They provide an environment where staff can develop a wide variety of statistical and technical skills and work on innovative projects that they put forward themselves. Evidence from the projects undertaken so far is that this is very motivational and allows staff to explore and learn about different areas of ‘Data Science’ thus equipping themselves and ONS for change. The future of the Innovation Labs is not yet clear. Their unique point is that any open-source software (or software for which we have an appropriate license) can quickly and easily be installed. This means we can try out new software and technologies that we would otherwise have to go through lengthy and possibly costly processes to get onto ONS systems. It also allows us to easily download and explore public data sources which again we may not have access to on the ONS network. Therefore, for ONS to continue innovating it is important that support for the Labs continue and that continue to be flexible. If there was a need to have a higher BIL environment, for example to allow some of the data collected by ONS to be able to be stored and analysed, accreditation would require the removal of such flexibility (as it opens up security risks). Therefore, to use data that is a higher Business Impact Level in a flexible environment, the only option is to reduce the IL level through anonymisation or by creating synthetic data which replicates the patterns in the higher IL level data. The Labs can then be used to test the new technologies/methods using the protected data - this process then provides the information necessary to make the case for using the new tools on a more secure platform with sensitive data in a 'production' ONS

Page 11: MWP1 ONS Innovation Laboratories

environment. The strategy therefore is to consider ways of making the Labs more robust to support ongoing use, but without losing their flexibility - and incorporate the learning from these into business cases for new tools on production ONS systems. As noted in the paper, the Labs are currently restricted to usage by Methodology and IT staff only. However, the plan is to open up the Labs to the rest of ONS in early 2015. Some pilot projects run by outside business areas have demonstrated how this will work. The likely demand is unknown but there are some areas who have expressed an interest in using the Labs to encourage innovation within the production of their statistics.

References Naylor, J., Swier, N. and Williams, S. (2014a) ONS Big Data Project – Progress report: Qtr 1 Jan to Mar 2014. Available at www.ons.gov.uk/ons/guide-method/development-programmes/the-ons-big-data-project/ons-big-data-progress-report.pdf Naylor, J., Swier, N. and Williams, S. (2014b) ONS Big Data Project – Progress report: Qtr 2 April to June 2014. Available at www.ons.gov.uk/ons/guide-method/development-programmes/the-ons-big-data-project/ons-big-data-progress-report-q2.pdf Office for National Statistics (2013) ONS Strategy 2013-2023. Available at www.ons.gov.uk/ons/dcp14298_323384.xml

Annex A -Hardware list for the Titchfield Lab

Fujitsu Esprimo E5720 desktop x 9 (2-core Pentium CPU, 4GB RAM, 160 GB HDD)

Fujitsu 22” monitors x9

Fujitsu Celsius M720 Workstation x 7 (8-core Xeon CPU, 64GB RAM, 128GB SSD boot drive),

4 x 2TB SATA HDDs in 3 of these workstations, 2 x 2TB SATA HDDs in 1 of these workstations)

Fujitsu Celsius M730 Workstation x 2 (10-core Xeon CPU, 128GB RAM, 128GB SSD boot

drive)

Synology NAS Diskstation DS1513+ with 5 x 3TB SATA HDDs

Netgear DGN1000 router

1xDLINK DGS1100-16 EasySmart Gigabit Switch

1xDLINK DGS1100-24 EasySmart Gigabit Switch

1xSamsung colour laser printer

Annex B -Openstack Environment Specification

The environment was built by Canonical consultants in March 2014 using the stable versions of software available at that time (MAAS v1.5, Juju v1.6 and Havana openstack release). The basic structure is that MAAS manages the virtual/physical machines which will host the openstack services. Juju is the software orchestrator which places the required services onto the MAAS managed machines.

Page 12: MWP1 ONS Innovation Laboratories

The hosts (physical machines) are therefore setup as follows (machines managed by MAAS are denoted by (M)):

1 x hypervisor machine running Ubuntu 12.04LTS 64 bit server. This hosts 12 Virtual Machines, as follows (all use Ubuntu 12.04LTS 64 bit server):

LDAP identity server used for desktop authorisation (using OpenLdap)

MAAS server and Juju client

(M) Juju Bootstrap server

(M) Openstack Dashboard

(M) MySQL database backend

(M) RabbitMQ messaging service

(M) Glance image storage gateway

(M) Cinder block storage gateway

(M) Ceph-radosgw object storage gateway

(M) Nova-controller cloud controller service

(M) Keystone identity service

(M) Neutron networking service

3x Storage nodes (M) each of which have 4 x 3TB HDDs. These provide the Ceph block storage service.

5x Compute nodes (M) which provide the compute service.

The Synology NAS (which has 5x 3TB HDDs) provides shared Virtual Machine storage for the compute nodes, and permanent storage for users data.