PESGM2014-000883 Resilient Core Networks for Energy Distribution

1

Resilient Core Networks for Energy DistributionNicolai Kuntze, Carsten Rudolph, Sally Leivesley, David Manz, Barbara Endicott-Popovsky

Abstract—Substations and their control are crucial for theavailability of electricity in today’s energy distribution. Ad-vanced energy grids with Distributed Energy Resources requirehigher complexity in substations, distributed functionality andcommunication between devices inside substations and betweensubstations. Also, substations include more and more intelligentdevices and ICT based systems. All these devices are connectedto other systems by different types of communication links or aresituated in uncontrolled environments. Therefore, the risk of ICTbased attacks on energy grids is growing. Consequently, securitymeasures to counter these risks need to be an intrinsic part ofenergy grids.

This paper introduces the concept of a Resilient Core Networkto interconnected substations. This core network provides essen-tial security features, enables fast detection of attacks and allowsfor a distributed and autonomous mitigation of ICT based risks.

I. INTRODUCTION

Increasing numbers of DER (distributed energyresources) systems are being interconnected to elec-tric power systems throughout the world. As DERtechnology evolves and as the impact of dispersedgeneration on distribution power systems becomesa growing challenge – and target of opportunity –nations worldwide are recognizing the economic,social, and environmental benefits of integrating DERtechnology within their electric infrastructure. (IEC61850-7-420)

DER control equipment is a vital part of modern and futurenetworks. Integrated phase synchronization, grid stabilizationand other functionalities provide a distributed energy gridcontrol, replacing centralized control in the direct control ofthe utilities. Communication networks and intelligent systemsin substations are at the core of these developments. Thereforefactors that would contribute to design of resilient core networksinclude an understanding of: strategic threats; principles forputting in place “work-around” solutions in a crisis; andstandards to manage changes in the environment of a crisis.As already stated in the IEC standard IEC 61850-7 [1],DER depend on communication to other entities in the grid.Information and Communication Technology (ICT) can providethis connectivity; however, it comes at the price of highervulnerability. Such a Smart Grid inherits all security issues andsecurity requirements that exist in other interconnected systemsand infrastructures. One major challenge is the reliability ofthe communication and the security of equipment used.

N. Kuntze and C. Rudolph are with Fraunhofer SIT, Germany{kuntze|rudolphc}@sit.fraunhofer.de

Dr S. Leivesley is with Newrisk Limited, UK [email protected]. Manz is with Pacific Northwest National Laboratory, USA

[email protected]. Endicott-Popovsky is with University of Washington, [email protected]

Equipment and communication links used for substationnetworks are prone to similar types of attacks as known fromenterprise networks or industrial control systems; however,the impact caused by an ICT based attack to substationnetwork equipment can be significantly higher than in corporatenetworks or ICS in production. In particular, the scope ofan attack can get very large if complete energy grids areaffected. Interestingly, security for substation networks isnot covered in the IEC 61850 but deferred to another IECstandard (IEC 62351-7). While IEC 62351-3 covers TCP/IPtechnologies, 62351-6 covers security profiles, and IEC 62351-7 focuses on power systems management and associatedinformation exchange, none of these sufficiently address thesecurity concerns raised in this paper. It provides guidance onthe protection of communication and network management.Resilience of interconnected substations requires a targetedapproach that considers the particular requirements of advancedsubstations.

II. SECURITY REQUIREMENTS

Substations in future energy grids need to support varioustypes of communication between substations, between bayswithin one substation and between devices located in differentsubstations. IEC 62351-7-1 states:

Many functions are implemented in intelligent elec-tronic devices (IED). Several functions may beimplemented in a single IED or one function may beimplemented in one IED and another function maybe hosted by another IED. IEDs (i.e., the functionsresiding in IEDs) communicate with functions inother IEDs by the information exchange mechanismsof this standard. Therefore, functions distributed overmore than one IED also may be implemented.

This complexity of possible interactions induces various se-curity requirements that need to be satisfied in order to providehigh reliability and resilience. First, substations themselves (e.g.routers serving as the gateway from outside) need a secureidentity in the network to establish trust. Traffic to and fromthe substation needs to be authenticated. Then, IEDs withinthe substation (e.g. clustered within one bay or single devices)also need to be identified and authenticated. Then, there needto be mechanisms to determine and securely report the healthstatus of the complete substations and/or critical devices withinthe substations. This reporting can be towards remote controlcenters, integrated with a SIEM (Security Information andEvent Management) or direct via the HMI (Human MachineInterface) within the substation.

Resilience also induces the need for highly available commu-nication. Thus, redundant communication , separate networksfor operation, control and reporting, emergency procedures to

978-1-4799-6415-4/14/$31.00 ©2014 IEEE

2

restore communication even when direct links to control centersare no longer available, and redundant physical transport mediamight be necessary for very critical environments.

Resiliency requires a system to maintain basic functionalityeven in the case of cyber-attacks on parts of the grid. Thus, onestrong requirement is to contain attacks and prevent infectionof bigger parts of the system. One underlying requirementis, that there needs to be a quick and decentralized detectionof attacks, i.e. of changes in the health status of intelligentelectronic devices, network components, etc. This informationneeds to consider software and configuration of the devices.Furthermore, reliable audit trails need to be generated for allICT components. these audit trails shall represent processeswithin bays, substations or a network of substations. Theseprocesses can also include human interaction with the system.

Finally, any such security technology needs to be groundedin the fundamental theory of resilience.This will enableresearchers and operators to move from a patch-and-catch mind-frame to a more proactive and systemically resilient solution.Technology today is built without scientific foundations, thrownagainst the wall, and evaluated without theory of resilience orany relevant metrics. Technology like Resilient Core Networks,grounded in theory of resilience, helps shift the paradigm to ascientifically grounded engineering approach.

III. SYSTEM HEALTH MEASUREMENT AND REPORTING

ICT security research and development has created amultitude of approaches to ensure the correct operation andprotection of assets in the area of enterprise/office use cases.Obviously, these cannot be directly applied to energy grids(All too often ICT solutions emphasize confidentiality atthe expense of availability; however, for the development ofsecurity architectures in critical infrastructures, it is importantto revisit these technologies and to reuse and adapt conceptsfit for the purpose. Adaptation is necessary, because criticalinfrastructures have specific characteristics & requirements withrespect to user interaction, maintenance times and availability.

Particularly relevant is the status of an individual devicein the network. Malicious modifications to single devicesneed to be detected. Health reporting is the issue that isdiscussed in the following paragraphs. The approach uses stateof the art, commercially available, hardware-based securitytechnology. The discussion covers the protection of individualdevices, communicating the health status, establishment ofprotected channels between different types of componentswithin substations and security monitoring by collecting andevaluating information on relevant events.

A. System Attestation

In the design of secure devices tamper resistant and tamperevident need to be distinguished. Tamper resistance preventsattacks on a device. This can be achieved through rigoroustesting and hardening or strict physical separation. Tamperresistant design is expensive and not feasible for complexinterconnected systems. In contrast, tamper evidence is theproperty that malicious modifications can always be detected. In

tamper evident systems, manipulations might not be prevented,but do not remain undetected.

A standardised technology available to implement systemattestation is Trusted Computing [11]. This technology usesa special hardware trust anchor, the Trusted Platform Module(TPM) [4], to store information on the current status of thesystem. Further, this information can be reliably reported toother entities and compared with reference values [6] (thusenabling remote attestation). A TPM-enhanced system istherefore able to provide proof on its status at any given time.

Tamper evidence using a TPM is built by introducing ameasurement chain in the design of the device. All softwareis measured before execution and this measurement is storedas hash values in the TPM. Using a TPM-generated signatureof the hash values representing the current status, an externalverifier can check the status of the device. Several protocolsfor integrating this process into infrastructures are available.One example is Trusted Network Connect (TNC) [14].

B. Channel protectionSubstations contain various types of devices. Encryption of

communication links using standard protocols such as TLSor IPsec is not always suitable due to device limitations.Other targeted security mechanisms can be available andcan be combined to achieve some level of security [12].For example, the US DOE developed a Secure SCADACommunications Protocol, bump-in-the-wire solution for justsuch legacy systems [3], [2]. Also, encryption is not alwaysrequired as shown in section II. More importantly, the integrityof the channel needs to be monitored to prevent maliciouspacket modifications or packet injection. Efficient and fastmechanisms to detect (not prevent) changes to communicationchannels exist (e.g. based on chains of hashes) and canbe deployed in channels with securely identified endpointsor communicated via independent communication links orindependent logical channels.

Individual transport links for channel protection have theadvantage over encrypting traffic itself in that the transfer inthe actual communication network is not impacted. Packetsfor the detection of modifications can be transferred later notimpacting the core functionality of the network.

C. Reporting on the level of the infrastructureRouters, switches and other components building the commu-

nication infrastructure can report events visible on the network.Also other IEDs can report events. Such event informationcan provide valuable information on the current status ofthe communication network and on the energy grid itself;however, analysis of events reported by the infrastructure is verydifficult without considering the context of previous operationsor the current functional processes. Correlation of events isthe key to a better understanding and interpretation of eventsto detect malicious operations in the network. Metadata-basedapproaches, e.g. using IF-MAP where a metadata graph canrepresent the current status of a network, can close the gapon event correlation. Additionally, SIEM-based solutions canpotentially fuse information from several disparate sources(ICT and ICS) to provide a holistic picture of security events.

3

IV. RESILIENT CORE NETWORK

IT-based attacks on Industry Control Systems require someaccess to intelligent devices via communication channels. Thus,the communication layer is one key element for resilience andneeds to form a protected core system that is difficult to attack,that can contain attacks to small parts of the network andmaintains core functionality even under attack. The followingfeatures are required to form an architecture able to preventattacks and retain core functionality also in the case of attacks.

Trust in the operation of individual devices is based onhealth monitored components. Remote health monitoringenables the detection of changes to the system which thencan be classified and reacted on. This supports the detectionand mitigation of even previously unknown attacks (in contrastto virus protection). A reliable networking functionality buildsthe basis of the core system. Each communication componentis health-monitored by components in the neighbourhood.This allows a network-based resilience to form against ICTattacks which also provides the ability to protect attached con-trol systems without sufficient inherent security functionality.Continuous authenticated information on the status of theinfrastructure allows operators to establish a master display forsecurity and to orchestrate mitigation actions. This is extendedby meta-data information on the core network itself and ondevices attached to the network. Depending on the availableinformation, this monitoring can be used to check security rulesfor the complete system of systems. ICS aware mitigationof detected risks with respect to the system status and typeof risk detected (malicious, accidental, on purpose, ...) can beachieved by running the risks identified against continuouslyupdated scenarios. This assists operators to manage the issueand to pre-emptively expand resources to meet more dangerousthreats and to engage wider threat response actors. The ICSaware mitigation has unexplained anomalies as an outcomealongside probabilities of identified sources and it is a forensictool to be built into the system.

A. Resilient core for individual substations

This sections introduce the concept of a resilient corenetwork. The concept relies on already available technologiesthat can be deployed in the area of substation networks, but alsoin other related environments. This paper does not cover initialprocesses in the establishment of trust into an individual deviceor set-up procedures. These processes are covered in previouswork. Examples include a lightweight configuration and trustestablishment [8], [5] and the required trust architecture [7].

Systems for power utility automation such as protectiondevices, breakers, transformers or substation hosts form localinter-networked entities with interfaces to HMI, engineering,central control and management, as well as to other substations.A substation itself features a Substation Automation System(SAS) which connects Intelligent Electronic Devices (IED)within the substation. Communication from a high level analysiscan therefore be distinguished in (i) local SAS communicationbetween IEDs, (ii) inter-substation communication and (iii)substation to command interaction.

The information exchanged in all three communicationscenarios is crucial to the operation of the substation andthus for the operation of the energy distribution. Fast responseto ICT-based attacks on components require distributed andautonomous detection and reaction as discussed in section II.Each component individually and separately needs to be ableto assess the health of communication partners without helpof any central entity. Further, flexible response needs to enableappropriate mitigation techniques to isolate and reduce theimpact of ICT attacks on different levels.

The core part in this concept is the detection of maliciousmodifications to components using system health measurementand reporting. This reporting can be instantiated on all threeindividual communication relations as shown in IEC 61850-7.On the level of individual substations, the SAS is the centralnetwork all components are attached to. Therefore, the SAS isalso the central element to establish core security functionalitiesand to enforce technical security policies within a substation.

Continuous health checks between SAS components establisha resilient SAS. Components connected (esp. IEDs) to theSAS are checked before access is granted to SAS resources.Connected components are continuously checked at runtime.Security checks to IEDs can be performed using direct healthchecks if the IEDs in question is equipped with suitablesecurity technology like the introduced Trusted Computing.Indirect checks, like package analysis or port scans, allow forsecurity assessment of legacy equipment. It is important tonotice that, based on health checks, it is possible to support adegraded operational status (see section IV-E). Thus, networks(and essential parts of substations) can stay operational evenunder ICT-based attack or during recovery from attacks. Thisenables a resilient network to fight through an attack. EnergyDistribution systems can continue to support critical functions.

To enable external entities (e.g. other substations or controlcenters) to verify the security status of all intelligent compo-nents of a complete substation it is required to analyze thestatus of the SAS and the respective relevant bays involved inthe communication. First, one would check the router acting asthe gateway to the substation. Then devices in different baysare checked. These bays can be considered independent units inthe substation. Thus, only the bay addressed in communicationneeds to be verified. Each bay can be again contain a set of IEDswith dependencies, inducing a particular order of health checks.And so on down within each bay to the individual components(i.e. IEDs). All this information can also be collected by theSAS or the router. Combined reports can then provide concisehealth information on the substation and on individual bays.

B. Resilience and health monitoring for the network

Individual security checks and monitoring of substationsprovides information on separate parts of the grid. Combiningthis information in a hierarchic and distributed security moni-toring concept establishes a resilient core network for a largenetwork of substations. The technology for this distributedhealth monitoring relies on establishing the technology in thedifferent intelligent devices, most important is the ability totestify the status of a device to a remote verifier. The integration

4

of security functions allowing for these security reports isshown in [13] and a practical implementation was shown in[9]. Distributed health monitoring requires also reference valueson the side of the verifier. Creating reference values is complexin the area of general purpose ICT systems as used in privatelyowned devices. Special purpose devices typically have a wellknown and understood software outfit.

C. Security management and visualization

The resilient core network described above provides a largeset of security-relevant information. A high level of reliabilityis achieved by using hardware-based security to protect thisinformation; however, evaluation, correlation, visualization andinterpretation of this large amount of information for humansis a non-trivial task. Reactions to attacks and reported changesin the status of the network need to be induced and controlledby operating personnel. Therefore, support tools displayingthis information are essential. This display needs to showinformation on suitable abstraction levels to enable humanunderstanding of the correlated health status; however, it alwaysmust be possible to trace back events to the individual sources.This linking is necessary to be able to define and executetargeted reactions.

The ability to fuse information from multiple disparateinformation source – router and switch logs, firewall andIDS alerts, host events logs, etc. – is often a complex,manual and arduous task. Automated and semi-automatedsolutions exist in Security Information and Event Management(SIEM) tools [10]. While these are ICT products, there isconsiderable interest in being able to merge traditional IT andcommunications information with ICS telemetry. Finally, web-based visualization and analysis approaches can be foreign totraditional ICS operators and careful introduction and parallelutilization will be required. Using modern information andevent management tools will enable greater situational aware-ness for control system operators, cyber security defenders,and critical infrastructure as a whole.

D. Meta data information

The collected information and meta-data provides detailedinformation on a grid as a system of systems. This is then usedto continuously run and evaluate strategic threat scenarios forcritical function failure points in each system. This integratesthe system view and gives an output of a “bad health” status.Games theory operational research or other risk evaluationsolutions can then be performed on the integrated systemreport. This changes the predictability of systems recoverywhich otherwise is vulnerable to sustained attacks. In the eventof natural hazards or accidental loss events, such a monitoredsystem of systems gives operators a greater range of solutionsfor sustaining the service.

The approach can be extended by a resilience simulationmodel, unconnected to the system, that tests emerging threatsand risks and also unpredicted strengths in the expandingnetwork. This would feed changes into the system of systemsassumptions against which the strategic threat scenarios are run.Such a view can include systems anomalies from all causes

or unsolved reports and will build a greater strength againstattacks masked as anomalies on a system. Natural hazards suchas solar storms and rare consequences of nuclear conflict suchas EMP and IEMI (Intentional Electromagnetic Interference)which is an emerging threat, may at present, be unmappedon DER national resilience studies. A resilience simulationmodel will provide new data on vulnerabilities. The proposedfocus is not exclusive to vulnerabilities and is designed toidentify unexpected risk reduction arising from factors such asreductions in distance or perhaps growth of multiple pathwaysthat replace previously critical vulnerable points in the system.

A test bed for human factors failure modes across the systemwould promote strengthening through compensatory design ofphysical hardware and/or systems reporting. Human factors maydirectly cause accidental loss events with serious consequencesand may be the cause of failure to recognize multiple factorsthat are causing an anomaly.

E. Reactions and mitigation

As the availability of the network is of utmost importance,reaction to security events needs to minimize the impact onthe network operation. From a high level view, the followingfour categories of health status can be distinguished. Compliantto the expected behaviour. In case of assessment on the basisof TPM- enforced remote attestation, the binary measurementsof all components and communication paths are verified andconsidered as trustworthy. Also other techniques for end pointassessment show no unexpected behaviour. In this case, correctbehaviour can be expected and additional connectivity can beallowed, e.g. for maintenance or for distribution of updates.Outdated software on the device which can be detectedeither by comparing to old measurements or known behaviourdetected by other means that matches an old and deprecatedbehaviour. This status might indicate vulnerable systems, butdoes not directly show an attack. The device can continue tooperate (maybe in restricted mode) until a suitable time forupdating the software occurs; Typed malicious software incase of already known and well understood changes to thesoftware on the device being checked. A malicious modifieddevice may still be providing the intended service and couldunder special circumstances still be used in the network. Thesecircumstances are threat specific protection measures enforcedthrough the infrastructure the device is connected to. This caseallows for targeted reactions to the known changes. This wouldaffect the trust calculations for risk in simulation or metadata-based approaches like IF-MAP. Previously unknown status orbehaviour of a device. In this case, the focus is to quarantinethe device in order to prevent the spread of a possible maliciousinfection. It is important to understand that even in quarantinethe functionality of a manipulated device can still be provide;however, device interactions need to be strictly monitored untilhealth of the device can be restored.

A reaction to a security incident needs to focus on providingmaximum availability and therefore tailors the mitigationaccordingly, preventing a spread. A spread of the attack couldcause even larger periods of reduced availability. A mitigationof risks deriving from incidents as introduced requires first an

5

appropriate quarantine of the device in question. The levelof quarantine can be from total disconnect of the device even bypowering down the physical port to a specific access restrictiondue to a specified access restriction aware of the specific attack.Within a quarantine environment the network can also providespecific services to establish the compliant behaviour of thedevice. These service involve update procedures, but also anindependent service network that allows for a remote accessof an administrator for a deeper analysis of the device. Anunknown behaviour should also trigger an investigative processtrying to identify the specific nature of the incident and itspotential impact on the service.

After isolation, it is important to restore a clean functionalityof the malicious device. The architecture needs to feature arestore without a reboot of the modified device to a cleansystem status or provide for other seamless means to providefor service restoration. One well known way is to use sparedevices or backup systems. Thus, it has to be kept in mindthat spare devices are prone to attacks as well.

Managing the availability of the control systems and com-munication links is one of the critical components of systemssustainability. A risk assessment of this critical componentwould state that terrorism and state-sponsored conflict aregrowing in 2013 and asymmetric attacks are a continuousthreat. The control system is the logical target for ease ofentry into Critical National Infrastructure (CNI). These attackscan have cascading consequences across an energy distributionsystem and remove a nation’s capacity to recognize an imminentthreat and to deploy resources to respond to the threat or to beable to mitigate consequences to large population numbers ifenergy is unavailable. This type of attack can weaken publicperception of the government’s security capability and an attackon energy distribution in fragile states is a significant risk togovernment stability for this reason. Smart cities are particularlyvulnerable to failures with high impact because of the level ofconnectedness and integration.

V. CONCLUSIONS

A resilient core network is a strategic solution basedon a platform that is designed to manage an expandingrange of failure modes within energy systems. The designcomponents suggested in this paper recognize that there is arapid expansion of factors that may affect energy availability.The conceptual discussion suggests that a resilient core networkdesign may be dynamic and pre-emptive in its operations, andbased upon foundational cyber science and resilience theory.This would replace the traditional mode of security whichrelies on response to attack consequences and dependence onretrospective diagnostics once systems are infected or affectedby unexpected hazards in the environment.

A strategic approach is needed when there is such a rapidexpansion of sources of energy and a drive to much greaterconnectedness between countries. The threat profile for energyinfrastructure has expanded due to terrorism, proxy conflictsbetween states. ICS attacks have been used in asymmetric proxystate conflict and human factor failures are naturally presentin rapidly changing systems as designers and operators fail

to recognize emerging risks from the expanded and changedsystems.

Energy systems require catastrophic failure analysis basedon the ongoing rapid expansion of energy modes. This paperhas explored some of the security requirements, includingreporting and novel designs for resilience of networks underattack. The discussion on key concepts of a resilient corenetwork has established five key processes which may beworked through to proof of concept for new designs to protectCNI. The availability of ICS has been identified as one ofthe critical inputs to energy sustainability and four categoriesof health status based on critical failure modes have beendescribed. Of special importance is the ability within a systemto mitigate by preventing a spread of an attack and this hasbeen approached by quarantine. Novel design concepts havealso been discussed based on changing the way in which ICSmay be air-gapped pre-emptively in an energy distributionsystem. Proposals have also been made for new developmentof system of systems and for a resilience simulation model andtest bed. These two components will strengthen the ultimatesustainability of energy distribution at national and local levels.

The resilience core network concept requires further devel-opment to move to proof of concept and for testing acrossa selection of networks in several countries to develop novelsolutions for the highly vulnerable interfaces of energy supplybetween nations across land and sea and to strengthen a nation’senergy distribution sustainability against a range of threats.

REFERENCES

[1] IEC SC 57. Communication networks and systems for power utility au-tomation. Technical Report IEC 61850, The International ElectrotechnicalCommission, 2011.

[2] M.D. Hadley and K.A. Huston. Secure scada communication protocolperformance test results. U.S. Department of Energy, 2007.

[3] Hank Kenchington and Rhett Smith. Doe national scada test bed (nstb).U.S. Department of Energy, 2008.

[4] Steven L Kinney. Trusted platform module basics: using TPM inembedded systems. Newnes, 2006.

[5] N. Kuntze and C. Rudolph. Secure deployment of smartgrid equipment.In Power and Energy Society General Meeting, 2013 IEEE, 2013.

[6] N. Kuntze, C. Rudolph, I. Bente, J. Vieweg, and J. von Helden.Interoperable device identification in smart-grid environments. In Powerand Energy Society General Meeting, 2011 IEEE, july 2011.

[7] N. Kuntze, C. Rudolph, M. Cupelli, J. Liu, and A. Monti. Trustinfrastructures for future energy networks. In Power and Energy SocietyGeneral Meeting, 2010.

[8] Nicolai Kuntze and Carsten Rudolph. On the automatic establishmentof security relations for devices. In Proceedings of the IFIP/IEEEInternational Symposium on Integrated Network Management, 2013.

[9] Pedro Larbig, Nicolai Kuntze, Carsten Rudolph, and Andreas Fuchs. Onthe integration of harware-based trust in embedded devices. Konferenzfur ARM-Systementwicklung, 2013.

[10] David Miller and Brock Pearson. Security information and eventmanagement (SIEM) implementation. McGraw-Hill, 2011.

[11] Chris Mitchell. Trusted computing. IET, 2005.[12] Ludovic Pietre-Cambacedes and Pascal Sitbon. An analysis of two new

directions in control system perimeter security. In Proceedings of the3rd SCADA Security Scientific Symposium (S4), pages 4–1, 2009.

[13] Martin Schramm, Karl Leidl, Andreas Grzemba, and Nicolai Kuntze.Poster: Enhanced embedded device security by combining hardware-based trust mechanisms. In Proceedings of the 2013 ACM SIGSACconference on Computer & communications security, CCS ’13,pages 1395–1398, New York, NY, USA, 2013. ACM.

[14] Josef von Helden, Ingo Bente, Jorg Vieweg, and Bastian Hellmann.Trusted network connect (tnc). European Trusted Infrastructure SummerSchool, page 54, 2009.

Documents

PESGM2014-000883 Resilient Core Networks for Energy Distribution