Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
(USC Thesis Other)
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DEFENDING INDUSTRIAL CONTROL SYSTEMS: AN
END-TO-END APPROACH FOR MANAGING CYBER-
PHYSICAL RISK
By
Yatin Wadhawan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2019
Copyright 2019 Yatin Wadhawan
ii
Defense Committee
Dr. Clifford Neuman (Committee chair), Computer Science, USC
Dr. William G.J. Halfond, Computer Science, USC
Dr. Viktor Prasanna (External Faculty), Electrical Engineering, USC
iii
Dedication
To my beloved grandfather, parents and 187 …
iv
Acknowledgment
I would like to thank my advisor and the chair of my defense committee, Professor
Clifford Neuman, for allowing me to work under his supervision and supporting me throughout
my Ph.D. journey. I would also like to thank Professor Neuman for giving me the opportunity
to work as a research assistant at Information Science Institute on the risk assessment of the
Los Angeles Department of Water and Power Smart Grid Regional Demonstration Project
(SGRDP), and Northrop Grumman Cyber Security Consortium. Besides my advisor, I would
like to thank the members of my defense: Professor Viktor Prasanna and Professor William
G.J. Halfond. In addition to the defense members, I would like to thank Professor Jelena
Mirkovic, Professor Milind Tambe, and Professor Mohammad Naveed for serving on my
qualification exam committee.
My sincere thanks also go to Professor Anas AlMajali for his useful discussions,
comments, and contribution to my research. I would also like to thank the Department of
Computer Science for supporting me throughout this beautiful journey. I especially thank my
colleagues who are in different areas of research, Ramesh Manuvinakurike, Vinod Sharma, and
Debarun Kar, for the lengthy and detailed discussions and comments on my work. I would like
to acknowledge Alba Regalado-Palacios for her administrative and inspirational support at ISI.
Finally, I am sincerely grateful to God, my beloved parents (Rakesh and Rekha
Wadhawan), my cousin (Priyanka and her husband Rishabh Kapoor), my family members and
friends for their constant support. Your constant encouragement, resilience, patience, and belief
in me made this possible.
v
Table of Contents
Dedication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii
Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iv
List of Figures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xiv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xv
Chapter 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.1 Scope. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Resilience vs Risk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
1.3.1 Smart Grid Resilience. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
1.3.2 Distinct Functional Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
1.3.3 Agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
1.3.4 Resource Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 High-level Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 Risk Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
1.4.2 Risk Mitigation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
1.5 Thesis Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
Chapter 2 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
2.1 Risk Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
2.2 Resource Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Related Work Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40
Chapter 3 Smart Grid Resilience: Attack Defense Approach. . . . . . . . . . . . . . . . . . . . . . .46
3.1 Objective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Worm Propagation Attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
3.3 Attack on Gas Pipeline Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
3.3.1 Function-Based Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51
3.3.2 Pressure Integrity Attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
3.4 Denial of Service Attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Analysis Methodology and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56
3.6 Usefulness to Power Engineers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
vi
Chapter 4 End-to-End Risk Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67
4.1 Risk Assessment Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Smart Grid Cyber Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Cyber Domain Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.2 Test Cyber Domain Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
4.3 Likelihood of Attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
4.3.2 Is Bayesian Network efficient for Smart Grid? . . . . . . . . . . . . . . . . . . . . . . . . . . .73
4.3.3 Bayesian Attack Graph for Smart Grid (BAGS) . . . . . . . . . . . . . . . . . . . . . . . . . .77
4.3.4 Probability to Compromise a Cyber Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . .83
4.3.5 Tool Prototype and Simulation Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88
4.4 Quantify Impact of Manipulating Circuit Breakers. . . . . . . . . . . . . . . . . . . . . . . . . . .92
4.4.1 Physical System Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2 Quantify Impact and Simulation Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94
4.5 Risk Determination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98
4.6 Usefulness to Power Engineers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100
Chapter 5 Risk Mitigation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102
5.1 Reduce the Likelihood of Attack in the Cyber Domain. . . . . . . . . . . . . . . . . . . . . . .102
5.1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103
5.1.2 System Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104
5.1.3 Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .107
5.1.4 Experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110
5.1.5 Simulation Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113
5.1.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117
5.2 Reduce the Impact of an Attack in the Physical Domain. . . . . . . . . . . . . . . . . . . . .119
5.3 Reduce the Impact using Governor in the Cyber Domain. . . . . . . . . . . . . . . . . . . .121
5.3.1 Base Concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .121
5.3.2 Attack via Demand Response Functionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . .125
5.3.3 IGNORE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.3.4 IGNORE for Demand Response. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.3.5 Experiment Demonstration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .140
5.3.6 Analysis of Special Cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148
5.3.7 Countermeasures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152
5.3.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152
vii
5.3.9 Governor Protection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .153
5.4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154
Chapter 6 Power Storage Protection Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156
6.1 Attacks on Power Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156
6.2 Power Storage Protection Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158
6.2.1 Agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .161
6.2.2 System State. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .161
6.2.3 Actions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .162
6.2.4 State Transition and Obsservations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .164
6.2.5 Payoffs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .165
6.2.6 Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .167
6.3 Experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .168
6.3.1 POMDP Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .168
6.3.2 Solving POMDP Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171
6.3.3 Simulation Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .172
6.4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173
Chapter 7 Defending Oil Pipeline from Cyber-Physical Attacks . . . . . . . . . . . . . . . . . . .174
7.1 Domain Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .174
7.1.1 Motivation of Players. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .175
7.1.2 Understanding of Cyber Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .176
7.1.3 Understanding of Physical Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177
7.1.4 Challenges Faced by Players. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .178
7.2 Game Theoretic Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .181
7.2.1 Stackelberg Security Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .181
7.2.2 Rewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183
7.2.3 Computing Optimal Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186
7.3 Theorems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .191
7.4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .194
Chapter 8 Case Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .195
8.1 Historical Blackouts and Attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .195
8.1.1 2003 Blackout in US and Canada. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .195
8.1.2 2012 India Blackout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .197
8.2 Governor for Automated Car Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .198
8.3 Application of Attack-Defense Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .200
viii
8.4 Application of End-to-End Risk Assessment Methodology. . . . . . . . . . . . . . . . . . . .201
Chapter 9 Thesis Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .202
9.1 Thesis Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .202
9.2 Limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206
Chapter 10 Conclusion and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.2 Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .210
10.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212
10.4 Concluding Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .213
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215
ix
List of Figures
1.1 End-to-End Risk Management: Unifying Concept. 3
1.2 Smart Grid Instability: Attack Graph. 7
2.1 High level overview of related work objectives and approaches. 24
3.1 Attack-Defense Approach.
47
3.2 Demonstration of the distribution of meters in the simulated area. 49
3.3 Oil and Gas Cyber-Physical System Attack Graph.
49
3.4 Attack Tree based on function-based methodology. 51
3.5 Gas Pipeline Model and Attack Scenario. 54
3.6 DDoS attack on gas pipeline system to reduce the response of the SCADA.
55
3.7 IEEE 9-bus power model. 56
3.8 Generation and Load: No Attack. 60
3.9 Area Frequency: No Attack. 60
3.10 Worm Propagation over WMN. 61
3.11 Generation and Load: Worm Propagation Attack. 61
3.12 Area Frequency after Worm Propagation attack. 62
3.13 System Admin response to Worm Propagation Attack. 62
3.14 Remote Terminal Units along gas distribution pipeline compromise. 63
x
3.15 Pressure Integrity Attack on gas distribution pipeline. 63
3.16 Generation and Load after pressure integrity attack on pipeline. 64
3.17 Area Frequency after pressure integrity attack on pipeline. 64
3.18 DDoS on WMN of gas distribution pipeline 65
3.19 Reduced SCADA Response to DDoS attack. 65
4.1 Smart Grid Cyber domain.
68
4.2 Test Network.
68
4.3 Smart Grid Resilience Modeling. 73
4.4 Demand Response Internal Functions. 76
4.5 BAGS System Design. 76
4.6 Function Bayesian Network. 79
4.7 Network Bayesian Network. 80
4.8 Vulnerability Network. 81
4.9: Function (top), Network (Middle) and Vulnerability (bottom) Bayesian
Network.
88
4.10 Bayes.jar tool Function Nodes. 88
4.11 Probability distributions when probability of remote attacker to attack is 0.70. 90
4.12 Unconditional Probability Distributions when Billing Engine’s SQL Injection
vulnerability is patched. The effect of such change is propagated to other
components.
90
4.13 Unconditional Probability Distributions when Remote Code Execution is
vulnerability is discovered in Billing Engine server. The effect of such change
is propagated to other components.
91
xi
4.14 The IEEE-39 Bus Model. 93
4.15 The frequency of the system on four buses when the attack happens. Attack
performed by opening the breaker of generator on bus 30 and closing it after
time T. Each figure demonstrates different duration during which the breaker
remained open. The breaker is opened after 1 second of the simulation. NO PV
used in the simulation.
96
4.16: The rotor angle of the system on four buses when the attack happens. Attack
performed by opening the breaker of generator on bus 30 and closing it after
time T. Each figure demonstrates different duration during which the breaker
remained open. The breaker is opened after 1 second of the simulation. NO PV
used in the simulation.
96
4.17 The frequency of the system on four buses when the attack happens. Attack
performed by opening the breaker of generator on bus 33 and closing it after
time T. Each figure demonstrates different duration during which the breaker
remained open. The breaker is opened after 1 second of the simulation. NO PV
is used in this simulation.
97
4.18 The rotor angle of the system on four buses when the attack happens. Attack
performed by opening the breaker of generator on bus 33 and closing it after
time T. Each figure demonstrates different duration during which the breaker
remained open. The breaker is opened after 1 second of the simulation. NO PV
is used in this simulation.
97
5.1 Q-Learning and SARSA learning Iteration. Agent (SA) performs an action on
(s) system state represented by FBN. The system moves to a state (s’). The
values s, a, s’ are given to the reward function that computes reward r’ and then
to Q-value update function. Finally, agent observes s’ and maximum value of
Q(s,a)
108
5.2 Each node represents a function and status (V: VULNERABLE, H: HACKED,
U: UNKNOWN, P: PATCHED). Admin performs action SCAN-S2 and
discovers that it is Hacked, the system moves to next state where S2 is Hacked
and admin receives reward: function importance.
112
5.3 Agent chooses action PATCH-S2 . The status of the function S2 changes to P
and system moves to next state described in right side figure. Admin receives
reward in terms of the importance of S2
112
5.4 Q-Learning: Plot of the moving average of the 300 average rewards per episode
for 100,000 trials with -Greedy policy for exploration and exploitation =0.2,
learning rate =0.2 and discount factor =0.2.
114
5.5 Q-Learning: Plot of the moving average of the 300 average rewards per episode
for 100,000 trials with -Greedy policy for exploration and exploitation =0.6,
learning rate =0.2 and discount factor =0.2
114
xii
5.6 Q-Learning: Plot of the moving average of the 300 average rewards per episode
for 100,000 trials with -Greedy policy for exploration and exploitation =0.8,
learning rate =0.2 and discount factor =0.2
114
5.7 SARSA-Learning: Plot of the moving average of the 300 average rewards per
episode for 300,000 trials, learning rate =0.2, lamda =0.4 and discount factor
=0.2
115
5.8 The modified IEEE 39-bus power model. This model includes a PV system at
bus 30 instead of the generator
118
5.9 The frequency of the system on four performed by opening the breaker of
generator. Each figure demonstrates different duration during which the breaker
remained open. The breaker is opened after 1 second of the simulation. PV is
used in this simulation at bus 30
120
5.10 The rotor angle of the system on four buses when the attack happens. Attack
performed by opening the breaker of generator on bus 30 and closing it after
time T. Each figure demonstrates different duration during which the breaker
remained open. The breaker is opened after 1 second of the simulation. PV is
used in this simulation at bus 30
120
5.11 An adversary compromises Demand Response Automation Server and a cyber
system in SCADA system to manipulate power demand & supply.
125
5.12 Cyber Network for Demand Response. It shows where to deploy Governor to
prevent attacks that are originated from compromised power utility and DRAS
servers.
131
5.13 Governor Design. IDS: Intrustion Detection System, RTDS: Real time power
simulation tool.
131
5.14 Load Shedding Governor Rule flow chart 139
5.15 IEEE 9-Bus System for Governor Experiment 139
5.16 P(t): Maximum load that can be dropped. This is the result of sensitivity analysis 141
5.17 Base Case - Generation Loss 144
5.18 Base Case - Generation Loss and Load Shedding 145
5.19 Load Shedding Attack- Within safe frequency limits 145
xiii
5.20 Load Shedding Attack- Beyond safe frequency limits of 61.8 Hz 145
5.21 Load Shedding Attack - With Governor 146
6.1 System State 161
6.2 State Transition 165
8.1 Example of architecture of in-vehicle network. This figure is borrowed from
[118].
198
xiv
List of Tables
2.1 Summary of the risk assessment related work approach, focus, and the cyber-
physical attacks or systematic approach to understand what is defined in focus.
38
2.2 Summary of the resource allocation related work approach, focus, and the
systematic approach to understand what is defined in focus.
39
4.1 Network Component Vulnerabilities. 87
4.2 Probability of Compromise using CVSS online Base Score. 87
4.3 Probability of compromising ECC for different levels of hackers. 87
4.4 Cyber-physical attack impact on the smart grid. Frequency deviation and impact
are deduced from [104]. The third column represents our estimation of the
quantitative impact.
98
4.5 Risk Values for Attacking CB connecting a generator at Bus 30. 99
4.6 Risk Values for Attacking CB connecting a generator at Bus 33 99
5.1 Function state transition and reward function 108
5.2 Attack Scenarios where governor might be useful.
122
5.3 Area wise attacks based on Contingency, Response and Attack (CRA) principle.
Suppose X and Y represent two areas or zip-codes as in Figure 5.12.
129
5.4
Zip Code-90057 Los Angeles Neighboourhood model for smart meters and load
distribution [127].
141
5.5 Approximate Number of Houses, Commerical and Industrial Customers in each
zipcode that are responsible for load on buses in 9-Bus System.
141
7.1 Normal Form Game 181
7.2 Rewards Table 185
9.1 Contributions to Risk Assessment 204
9.2 Contributions to Risk Assessment 204
xv
Abstract
Cyber-Physical System refers to a new generation of systems where physical processes
are controlled and monitored from the cyber domain through advanced computation and
communication technologies, including humans in the loop. An activity performed in the cyber
domain can affect the physical infrastructure or vice versa. The advancement of information
and communication technology and its integration with the cyber-physical systems such as the
smart grid and oil & gas pipeline has attracted researchers to show concerns in evaluating their
resilience. A key motivation for this work is the observation that current literature does not
assess the resilience of such systems in the presence of multiple cyber-physical attacks and
there are no efficient ways to determine when to respond and how to allocate resources
effectively to maintain resilience. Most researchers focus on assuming that attacks on the grid’s
cyber function have already happened, and they perform contingency analysis to evaluate
resilience. The current approaches demonstrate how resilient the grid is against attacks and
claim that they assist power engineers in identifying factors that should be considered while
developing robust systems. In reality, they do not perform a complete risk assessment, nor do
they assist power engineers in deciding what actions to take in order to reduce risk.
In this dissertation, we present an end-to-end risk management methodology for
Industrial Control Systems. We demonstrate the effectiveness of the methodology in the smart
grid cyber-physical system. The first step of risk management is risk assessment, and the
second step is risk-mitigation. In risk assessment step, first, we present an end-to-end risk
assessment methodology to quantify the risk of a cyber-physical attack. We describe how to
model the grid’s cyber domain to compute the Likelihood of Attack (LoA) using Bayesian
Networks and how to perform contingency analysis in the physical domain to quantify the
xvi
Impact of Attack (IoA). Finally, we combine the results of LoA and IoA to compute risk.
Second, we assess the resilience of the smart grid system in the presence of multiple cyber-
physical attacks on its distinct functional components using an Attack-Defense approach. We
enlarge the surface area of attack by considering oil and gas pipeline systems that supply fuel
to gas-operated peaker power plants. Through contingency simulations, we analyze multiple
cyber attacks that propagate from cyber to power system and discuss how such attacks
destabilize the underlying power grid.
We extend our risk analysis to assist power engineers in deciding on how to reduce risk,
the second step of risk management. We compute risk score of a cyber-physical attack using
two critical factors: LoA and IoA. The risk score represents how vulnerable the system is and
which factor is contributing more to the risk. Is it the likelihood of an attack or its impact?
Once power engineers understand this computed risk, they make decisions to reduce risk either
by reducing LoA, IoA, or both. We present various approaches to reduce risk in all the phases
of resilience: avoidance, containment, and recovery.
First, we present a reinforcement learning-based methodology to decide which system
to scan or patch in a smart grid cyber domain state with the main motive to reduce LoA, thus
reducing risk. This comes under the avoidance phase where the motive is to avoid the system
from getting compromised. Second, we demonstrate how replacing a generator with a Photo
Voltaic unit helps to mitigate the impact (thus mitigating risk) even when the attack is
successful, i.e, containment phase. Third, we introduce the concept of the governor. It is a
component that serves to protect a cyber-physical system from attacks that are more severe and
frequent than is acceptable by enforcing security policies on the actions of a system’s higher-
level functions (such as Demand Response in the smart grid). An essential characteristic of the
governor is that it evaluates the requirement and safety property of commands issued by a cyber
xvii
system, which may or may not be compromised, thus preventing them from reaching the
physical domain (containment phase). Finally, using game-theory, we present an approach to
model the interaction between a power engineer (defender) and an adversary to compute the
optimal policy for the defender to decide whether to decrease or increase power reserve,
perform load curtailment or shedding, etc. in order to minimize the cost of operation and
maintain power system stability. This approach comes under the recovery phase because
optimal policy assists power engineers to recover from cyber-physical attacks.
Moreover, based on game-theory, we present a theoretical framework for providing an
efficient solution to real-world oil stealing problem. The game models the interaction between
system engineers and cyber-physical attackers. The game has two different types of targets
attacked by two distinct types of adversaries with different motives and who can coordinate to
maximize their rewards. The solution to this game assists oil pipeline engineers to allocate the
cybersecurity controls for the cyber targets and to assign patrol teams to the pipeline regions
to prevent oil stealing.
The end-to-end risk management approach covers all three phases of resilience:
avoidance, containment, and recovery. The results of these approaches can be leveraged by
power engineers to build and refine security policies to maintain the resilience of the smart grid
system to disturbances caused by malicious and non-malicious threats. Power engineers must
derive secure policies to influence the system’s design and perform an in-depth risk
management using the proposed approach in this dissertation. The end-to-end risk management
approach can be applied to other cyber-physical systems such as oil and gas system, nuclear
power plant, and water treatment plant.
1
Chapter 1
Introduction
A nation’s prosperity depends heavily on the energy sector. The energy industry serves
hospitals, transportation, public networks, businesses, production of goods, and more.
According to the US Department of Homeland Security [1], the energy sector is divided into
three highly interdependent Industrial Control System (ICS): Electricity, Petroleum, and
Natural gas. Natural gas fuels electric plants that generate power; on the other hand,
components of natural gas and petroleum plant require power for operation. A failure of an
operation due to malicious and non-malicious factors in ICS has cascading effects on other
systems and different sectors of the economy. ICS is a Cyber-Physical System (CPS), which
refers to a new generation of systems where physical processes are controlled and monitored
from the cyber domain through advanced computation and communication technologies
including humans in the loop [2].
ICSs rely on Information and Communication Technology (ICT) to monitor and control
operations. The adoption of state-of-the-art ICT in ICS is enhancing their ability to make
efficient use of resources and making it smarter than ever before. For instance, deployment of
Advanced Metering Infrastructure (AMI) in the power grid. Smart meters gather the power
consumption and quality data at customer’s premises and send it to the grid’s control center
via AMI. Another example of ICT in the oil and gas system is the use of Remote Terminal Unit
(RTU). RTUs capture state information about the natural gas pipeline and send it to the utility
control system where it monitors pipeline leakage or damage. Although such implementations
show the benefit of using technology to enhance ICS operations, the growing interdependence
2
of ICS on ICT imposes extraordinary challenges on the security of such systems and gives rise
to Cyber-Physical Attack (CPA) [3]. CPA means an activity performed in the cyber domain
can affect the physical infrastructure or vice versa. For instance, if the cyber system that
controls and monitors the pressure of natural gas in a pipeline sends malicious commands to
change the pressure beyond secure limits, this may allow natural gas to leak or destroy the
walls of the pipeline.
Nowadays, attackers have shown the ability to perform advanced persistant attacks on
various ICSs. The Stuxnet malware attack [4] on industrial computer systems of Iran’s nuclear
program was responsible for causing substantial damage to the nuclear physical processes.
Recently, a cyber attack on Ukraine’s power grid [5] caused power substations to disconnect
for three hours, which affected 80,000 customers approximately and forced system operators
to switch to manual mode. A study conducted by Tripwire [6] states that 82% of the oil and
gas administrators reported that their organization was subject to cyber attacks in 2015, 69%
said they were not sure whether their system could detect a cyber attack. Not just the smart grid
but the oil and gas systems are also affected by CPAs. Consider the cyber attack on the Turkish
oil pipeline [7] company in 2008. The Baku-Tbilisi-Ceyhan pipeline was the most secured
pipeline in the world, but the cybercriminals injected a malware into the control network and
caused damage to the utility via explosion. The terrorist attacks on a natural gas plant in Algeria
[8], oil stealing in Mexico [9] and Nigeria [10], and a recent cyber attack on the Norwegian
Statoil [11] have raised the security fears of the energy utility companies.
According to Lloyd’s Emergency Risk report of 2015 [12], a major cyber attack on the
U.S. electric grid could cause over $1 trillion in economic impact and $71.1 billion in insurance
claims. Not just cyber attacks but a successful system failure can also affect the system’s
resilience — for instance, 2015 Porter Ranch gas leak in southern California [13]. A gas leak
3
over a period has the potential to create power outages due to the shortage of fuel (natural gas)
for the gas operated peaker plants that generate power during peak hours. These successful
attacks and failures clearly show how vulnerable ICSs are, and thus, it is essential to develop
methodologies in order to evaluate resilience and protect these systems from CPAs. To counter
the efforts of adversaries, who focus on multiple attack vectors and surfaces, we need to
consider the larger surface area for the risk assessment and mitigation by incorporating direct
and indirect attacks on distinct functional components of the system. Such a detailed
description of evaluating the resilience of CPSs is mainly unaddressed by the current literature.
To fulfill this requirement, a comprehensive and repeatable methodology is needed, which can
be applied to all CPSs.
In this dissertation, we present the end-to-end risk management methodology to defend
industrial control systems from cyber-physical attacks. We demonstrate the effectiveness of
the methodology in the smart grid domain. This research assists power engineers and admins
of other CPSs to understand how to perform the risk assessment to compute risk, what is the
significance of that risk, and finally, how to reduce risk in the cyber and physical domains.
Figure 1.1: End-to-End Risk Management: Unifying Concept.
4
1.1 Scope
The scope of this dissertation is to present a methodology to perform end-to-end risk
management of a CPS. Figure 1.1 shows the unifying concept that can be applied to any CPS.
In this dissertation, we choose the smart grid system to demonstrate the effectiveness of our
methodology. The smart grid can be divided into the cyber and physical domains. The cyber
domain is responsible for monitoring and controlling the physical processes remotely to
maintain the stability of the grid. It consists of functions such as electricity control center, meter
data management, customer information system, demand response, substations, etc. The
physical processes of the grid are power flow, automatic generation control, voltage regulation,
frequency regulation, etc. Based on the information gathered from various sensor units and
physical systems, cyber functions make decisions and issue commands in response to certain
contingencies.
Risk assessment is a process of discovering potential threats (malicious or non-
malicious) on the system and what could happen if a threat occurs. Through our research, we
show it is vital to consider both the cyber and physical domain in the risk assessment process
(the first step of risk management). To perform the end-to-end risk assessment, we model the
cyber domain in the form Bayesian Network to compute the likelihood of attacking a cyber
function, and then we perform transient stability analysis to quantify risk in the physical
domain in the presence of CPAs, which originate from the cyber function. We compute risk of
a CPA using two critical factors: the likelihood and impact of an attack. Risk represents how
vulnerable the system is and which factor is contributing more to the risk. Is it the likelihood
of an attack, or its impact. Once power engineers understand risk, they make decisions to
reduce risk in the following ways: 1) reducing the likelihood, 2) reducing the impact, and 3)
allocate resources efficiently.
5
Risk mitigation, the second step of risk management, is one of the challenging tasks for
system engineers. It is hard to generalize the methodology for reducing risk in CPSs because
they are federated systems, and they have various unique system parameters. Managing risk
requires the efficient allocation of resources, and it is different for different cyber-physical
functions. Nonetheless, we present multiple ways to reduce risk in the smart grid cyber and
physical domain.
The key contribution of the dissertation is to demonstrate how to perform complete risk
management of a CPS. Through this methodology, we show it is essential to consider the cyber
and physical domain in the risk analysis, how to fill the gap between risk assessment and risk
mitigation by understanding various system parameters and finally, build models to reduce
risk. Before we present our thesis statement, it is paramount to describe how resilience and risk
management are related.
1.2 Resilience vs Risk Management
Resilience is broadly defined as the “ability to resist, absorb, recover from, or
successfully adapt to adversity or a change in conditions” [129]. There are three stages of
system resilience. Failure avoidance is the first stage of resilience in which system engineers
deploy methodologies to predict, and avoid potential failures (or system compromise by an
adversary). Failure containment is the second stage of resilience in which the primary motive
is to contain the failure within an acceptable level and prevent it from propagating to other
parts of the system. Failure recovery is the last stage of resilience in which the system takes
actions to bring the performance of the services to their desired levels. In this dissertation, we
focus on all three stages. Towards the end of this section, we show how the contributions of
this dissertation focus on all three stages of system resilience.
6
Risk management is a methodology that assists engineers to compute and compare
different risks in terms of their likelihood of occurrence and impact on the system. Based on
this assessment, they decide what the countermeasures to reduce (or mitigate) risks are. The
concept of resilience is related to the risk management in three different ways according to the
report [129]:
1. Resilience as a goal of risk management
2. Resilience as a part of risk management
3. Resilience as an alternative to risk management
In this dissertation, we treat resilience as a goal of risk management which is based on the
main idea that “even the best risk management cannot lead to full protection” [129]. Therefore, the
goal of risk management is to reduce the likelihood and the impact of risks so that system can survive
in the presence of an incident, either malicious or non-malicious, and recover to an acceptable
level. To make a resilient system, the first step is to perform a risk assessment to compute risk, which
is the function of the likelihood of an event and its impact on the system. Risk assessment assists
engineers in understanding how an event occurs, what is the probability of occurrence, where and what
is the impact on the system, and what factors are contributing more to the risk. Without a risk
assessment, it is impossible to understand risk and take action to improve resilience. By performing
risk assessment, we are focusing on the first two stages of resilience.
The avoidance stage of resilience must focus on reducing risk by reducing the likelihood of an
event, for instance, performing effective patch management in the cyber domain of a CPS to prevent
cyber components (from where CPAs are mounted on the physical system) from getting compromised.
During the containment stage, the system must focus on reducing the impact of an attack in the cyber
and physical domain so that the system can adapt to the changes. Although an attack has already been
performed on the physical system, we can reduce its impact by placing protective measures in the
7
system. For instance, placing a policy server in front of a system’s higher-level function, which may
be compromised, to decide commands issued are required and safe for the system. Now the question
is how to recover from the unforeseeable risks. Risk recovery is equally crucial as avoidance and
containment.
During the recovery stage, system engineers must take specific protective actions to mitigate
risk at minimum operating cost. An event (or attack) has happened, and engineers receive certain
observations from the system. Based on the observations, they decide what actions to take to bring a
system’s functionality to the desired performance level. For instance, in the grid, if a generation loss
happens, power engineers issue commands to perform load curtailment or shedding (to reduce overall
power demand), which depends on the nature of the contingency. In order to improve a system’s
resilience, it is necessary to understand risk by performing an end-to-end risk assessment. In the next
section, we state the problem statement and show how the contributions of this thesis focus on the
three stages of resilience.
Figure 1.2: Smart Grid Instability: Attack Graph
8
1.3 Problem Statement
In this section, first, we present the thesis statement then discuss the questions that
should be addressed to support the thesis statement.
Thesis Statement:
Given the components and functions of a Smart Grid system, how does the
manipulation of distinct functional components affect smart grid resilience? How should we
allocate resources to maintain resilience?
To answer this thesis statement, we need to understand and answer these sub questions:
a. What is smart grid resilience?
b. What are the various smart grid components and functions? How are they
susceptible to cyber-physical threats? Who manipulate functions and how?
c. How to perform the end-to-end risk assessment methodology and to quantify
the risk of a CPA in the smart grid system? What do we mean by the computed
risk? What are the factors contributing to the risk and how are they useful to
power engineers in reducing risk?
d. What are the approaches in the cyber and physical domain to mitigate risk?
In the following sub-sections, we provide the basic understanding of questions a, b and
c. We discuss d question in section 1.4.
1.3.1 Smart Grid Resilience (SGR)
SGR is the ability of the smart grid system to avoid failure of its functions in the
presence of non-malicious and malicious activities and to recover from those failures to an
9
acceptable state without affecting the function delivery. The primary function of the grid is
power delivery. It gets affected due to the power demand and supply mismatch. The significant
consequences of the power demand-supply mismatch are instructing peaker plants to generate
power, utilizing stored energy which incurs some cost, increased prices, load shedding or
curtailment and partial or complete power outage in some parts of the grid which frustrates
customers.
Figure 1.2 provides a description of the attack graph for the grid instability. An
adversary can perform a variety of attacks on system functions such as malware attack,
Distributed Denial of Service (DDoS) attack on communication infrastructure, etc. as shown
in the first column of the figure 1.2. Once those functions are compromised, there is a physical
impact that leads to partial or complete function failure and finally affecting the grid’s
resilience. To illustrate with an example, suppose an adversary performs the pressure integrity
attack [30] on gas pipeline infrastructure which leads to stoppage of the gas distribution
pipeline. Since gas is not delivered to a gas operated peaker power plant, it cannot generate
power thereby affecting power supply during peak hours and thus, degrading SGR.
1.3.2 Distinct Functional Components
The functional components of the grid are classified into three categories [14]:
Generation, Transmission, and Distribution. Each category contains various functions that are
useful in maintaining the resilience of the system. In this sub-section, we discuss some of the
smart grid functions and how they are susceptible to cyber-physical threats.
Governor and Automatic Generation Control (AGC). It is the primary frequency
control mechanism use by the generation system to bring frequency to its nominal value after
a significant disturbance. A governor is present on each generator that senses the changes in
10
the frequency of the system and accordingly controls the steam valve. Since nowadays the
system uses digital controllers that are connected to the control center through Modbus
protocol, they are susceptible to cyber-physical threats. AGC operates a section of the grid and
controls governors that are present on multiple generators. It calculates the area control error
by computing the total load error and net interchange error between multiple grid areas, which
are called as Balancing Authority Area (BAA). AGC control center guides generators to
increase or decrease their output based on the area control error. Since it performs its operation
over the communication network, it is also susceptible to cyber-physical threats. For instance,
an adversary may perform false data injection attack against AGC to prevent frequency control
mechanism to bring frequency to the nominal state [15].
Distributed Energy Resource (DER). DERs are smaller power sources such as
electricity storage and renewable energy (the wind or solar Photo Voltaic [16]) come under
distribution category. They are present at smart buildings, customer’s homes (as solar panels),
and elsewhere. Whenever there is a need for extra power other than base load, DER is used to
balance the generation and load. Because of this, DER customers are “prosumers” who produce
and consume power. Using DERs, customers generate the electricity they need and also sell it
to the grid. This could require frequent communication between DERs and power utility.
Power utility sends control signals to smart meters to dispatch power from DERs or to
operate as an independent operator. Since communication happens over the Wireless Mesh
Network (WMN), DERs are vulnerable to cyber-physical threats. For instance, an adversary
may install a worm that propagates over the network and compromises other meters. The
payload worm carries control DERs maliciously, and it will not allow DERs to dispatch power
during peak hours, thus causing power demand and supply mismatch.
11
Advance Metering Infrastrcuture (AMI). AMI is the backbone of the grid
distribution infrastructure, which facilitates the communication between the control center,
sensors, PLCs, RTUs, etc. It is one of the components that make the power grid a smart grid.
It is composed of electronic hardware and software that help system administrators to gather
information about the grid from remote locations and control its functionality. AMI enables
features such as Demand Response, dynamic pricing (time of use, and real-time pricing). An
adversary can perform a variety of attacks on AMI, such as in [17] [18] [19] and affect the
grid’s resilience.
Demand Response (DR). One of the grid technologies that work over AMI is DR. It
enables consumers to play a significant role in reducing and shifting their load profile during
peak hours. It is defined as the changes in electricity usage by endpoint customers from their
daily routine pattern in response to variations in the prices of electricity over a period [20].
Power utility sends signals about the prices of the electricity to smart meters present at
customers’ premises. By prices and their consumption pattern, customers try to shift their
consumption during hours when rates are low. In other words, they store electricity when prices
are low and use it when prices are high. DR functionality is vulnerable to CPAs since signals
that are sent by the utility can be modified or blocked by malicious attackers. For instance, in
case of loss of generation, the utility center may send a remote disconnect command to remote
terminal units at customers’ premises with a motive to disconnect them from the grid in
response to increasing load and reduced generation. An adversary may perform the same set of
actions from cyber domain to drop load instantly [21] or over a period [22] to make frequency
go beyond under or overprotection threshold, thus destabilizing the power grid.
Gas Distribution Pipeline. The natural gas low-pressure distribution pipelines are
useful during peak hours to provide fuel to peaker plants to generate power. Any malicious [7]
12
or non-malicious [13] activity on the gas distribution pipeline affects the gas delivery resulting
in the loss of power generation when it is most required during peak hours.
1.3.3 Agents
Agents who are interested in the power grid are attackers, power engineers, and
customers. The primary motive of attackers is to reduce the resilience of the system by
performing malicious actions. They compromise confidentiality, integrity, and availability of
the various system components. The attackers are of two types based on the target they attack
— the Cyber Attacker (CA) who compromises Cyber Targets (CT) in the cyber domain. CA
performs a variety of cyber attacks such as malware injection, data integrity, DDoS,
unauthorized access, system hijack, etc. with a motive to compromise different CT to control
physical processes. The Physical Attacker (PA) who attacks physical components such as
machinery and hardware. The goal of the PA is to damage physical machinery such as
pipelines, to steal oil from the pipeline segments so that to sell it on the black market to raise
funds [8] [9], etc. Since attackers can perform sophisticated attacks on such ICS (as
demonstrated in [5]), there is a need to understand how attacks and their impact propagate
among various components, and how to perform resilience assessment in the presence of
multiple CPAs. The question is: how to evaluate the resilience of the smart grid system in the
presence of sequenced cyber-physical attacks on its distinct functional components and to
consider defender actions? We discuss an Attack-Defense approach to answer this question in
Chapter 3.
The motive of power engineers is to protect the underlying system from CPAs and
improve its resilience. They allocate resources efficiently at various places, perform patrolling
of physical targets, monitoring of network traffic, vulnerability assessment and penetration
testing of cyber components, etc. In order to maintain the resilience, the defender must allocate
13
resources in the cyber and physical domain after understanding the risk corresponding to a
CPA. The question is: how to perform a risk assessment that takes into account the likelihood
of attacking a cyber system and its impact on the physical system to quantify risk? How the
computed risk assists power engineers in deciding what actions to take in which domain? We
discuss the high-level approach to answer these questions in sub-section 1.4.1.
Finally, we have customers who sometimes act as attackers when they try to manipulate
their consumption readings [23], modify messages, or steal electricity [19] from the grid. They
decide power demand in a given area, and their requirement changes dynamically. They play
a significant role in the demand response functionality.
1.3.4 Resource Allocation
An adversary allocates resources with the motive to compromise the system and affects
its resilience. She identifies vulnerabilities in the system components and exploit them
manually and using a variety of automated tools. Before attacking a system, she performs
information gathering and social engineering to know about the target and understands how
resources are allocated by system engineers to protect the system.
Power engineers are responsible for allocating resources with a motive to make ICS
operations efficient and useful as well as secure. They make allocation decisions based on their
experience of the system, various types of attacks detected, the cost of deploying assets, the
criticality of the system components and functions, and system vulnerabilities. They must keep
sufficient resources so that the power demand is fulfilled and the cost of resources is
minimized. With the motive to protect cyber targets, they decide which cyber system to scan
or update software in a particular state, and allocate resources such as firewalls, intrusion
detection systems, antiviruses, and vulnerability scanners. For physical targets, they assign
14
patrol teams, install cameras, RTUs, and sensors. Their primary task is to decide what are the
critical targets and what resources to allocate in order to maximize the system’s resilience and
minimize cost are. The question is: what are the approaches that assist power engineers in
taking all such actions in a particular state? We discuss the high-level approach to answer this
question in the sub-section 1.4.2.
1.4 High-Level Approach
In this section, we discuss a high-level approach to perform the risk assessment of the
grid and how its results are useful to power engineers in deciding what actions to take to reduce
risk. This high-level approach reflects contributions made by this dissertation in answering
questions discussed in previous sub-sections.
1.4.1 Risk Assessment
In this sub-section, we discuss the key idea and support it by a simple example. Finally,
we provide the advantages of the proposed approach.
Key Idea
Risk is defined as the expected impact of a potential threat to the system [25] [26].
Typically, risk of a CPA is expressed as the function of the likelihood of an attack and its
impact. In order to perform a risk assessment, we need to compute these two factors. To
compute the risk of a CPA on the grid, first, we need to understand and model the cyber domain
of the grid, from where attack originates, to compute the likelihood of an attack. Then, we need
to perform transient stability analysis to quantify the impact of that attack on the physical
system. Once we compute both the factors, we compute risk.
15
Simple Example
How to compute risk of manipulating circuit breakers in the smart grid? Circuit breaker
is a device that connects the power generators to the grid. First, we need to model the attack on
the cyber system of the grid from where circuit breakers are controlled. Electricity Control
Center (ECC) controls circuit breakers in a region, and if it gets compromised by an adversary,
she can open or close any generator for any duration. The question is how an adversary reaches
ECC and what is the likelihood of compromising ECC. We need to understand the cyber system
of the grid to figure out what are the different entry points from where an adversary can enter
into the network and start compromising various components. She performs a vulnerability
assessment to discover weaknesses and exploit them to gain control over the system
components to move forward in the network. Once she reaches ECC, she controls circuit
breakers. The likelihood of compromising ECC depends on the vulnerabilities of the ECC and
the likelihood of compromise of all those systems that are present in the path from an entry
point to the ECC. The likelihood of an attack is based on known vulnerabilities and does not
account for zero-days. The next step is to quantify the impact by performing a transient stability
analysis of manipulating circuit breakers in the physical system using the Power World
simulator. Based on the physical parameters of the grid, such as frequency, we quantify the
impact. Finally, we compute risk using both the factors.
Advantages
The advantages are twofold. First, it is a complete end-to-end risk assessment
methodology. We do not just evaluate the impact of an attack on the physical system, but we
also consider the likelihood of an attack. The complete analysis assists power engineers to
understand which factor is contributing more to the risk. Accordingly, they take actions in the
cyber and physical domain to reduce risk. Second, the approach is repeatable. Power engineers
16
should use this approach to compute risk for various CPAs. The approach is not just limited to
the smart grid domain; instead system engineers of other CPSs such as water treatment plant,
nuclear plant, oil & gas systems, etc. must use this to perform a risk assessment. We discuss
the end-to-end risk assessment approach in detail in Chapter 4.
1.4.2 Risk Mitigation
In the previous sub-section, we presented an approach to compute the risk of a CPA
using two critical factors: the likelihood of an attack and its impact. The risk value represents
how vulnerable the system is and which factor is contributing more to the risk. Is it the
likelihood of an attack or its impact? Once system engineers understand the computed risk,
they make decisions to reduce risk in the following ways: 1) reduce the likelihood of an attack
(avoidance phase), 2) reduce the impact of an attack even when a cyber system is compromised
(containment phase) and 3) take optimal actions in the cyber and physical domain to mitigate
attacks at minimum operating cost (recovery phase).
The likelihood of an attack is reduced by updating the system with a new version of the
software so that the vulnerabilities are patched. The question is which system to choose in a
particular state to update software. The challenge is to decide which system to scan (to
determine whether it is vulnerable) or patch (update software to remove known vulnerabilities)
because scanning and patching system components in critical infrastructures are complex and
time-consuming tasks. Moreover, engineers have limited budget in terms of several systems
they can scan or patch in a given period. To perform either of these actions, system engineers
must place a function in standby mode, and it is not possible to stop the functionality of various
functions simultaneously. Furthermore, they know the internal network but not the position
where an attack might have occurred until the symptoms of that attack are observed. The
importance of systems also plays a significant role in deciding which system to patch. There is
17
a need to develop a tool that incorporates all these factors and performs this task efficiently. In
Chapter 5, sub-section 5.1, we discuss a Reinforcement Learning based methodology to answer
this question.
Another way to reduce risk is to reduce the impact of an attack. We present two ways
to reduce the impact, one in the physical domain and another in the cyber domain. In the
physical domain, we show how replacing a generator (under attack) by a Photo Voltaic system
can reduce the impact on the system even when a cyber system is compromised. We discuss
the results of the simulation in Chapter 5, sub-section 5.2. In the cyber domain, we present a
methodology to design a policy server to verify whether commands issued from a compromised
cyber system are required and safe for the physical domain. We discuss this approach in
Chapter 5, sub-section 5.3.
To make ICS resilient against real-world adversaries, system engineers must allocate
cyber and physical resources at minimum operating cost. An adversary performs a variety of
attacks such as topology attacks, integrity attacks, and hijacking attacks on the cyber and
physical infrastructure of the grid. System admin does not know where and what kind of attacks
are performed by an adversary. She does not know the real state of the system. She receives
observations about the change in the system infrastructure, such as whether a node or link is
Active or Inactive, whether some nodes are malicious or not, etc. The observations are received
with some uncertainty from an intrusion detection system installed in the environment. It is
crucial to concentrate on what actions the admin should take to meet power demand at a
minimum operating cost in the presence of CPAs by an adversary on the grid's power and
information infrastructure. The admin should be able to decide what actions, such as whether
to decrease or increase power reserve, perform load curtailment or load shedding, repair a node
or not, to perform in order to minimize the cost of operation and maintain power system
18
stability. To answer this question, we present a theoretical framework, based on Partially
Observable Markov Decision Process (POMDP), for formulating the above problem and
provide experimental results to support our claim using a simplified scenario in which optimal
policy is computed efficiently using POMDP solver. We discuss this approach in Chapter 6.
The real world attacks on the oil and gas pipeline system are frequent [7] [8] [9] [10]
[11]. The pipeline’s cyber domain controls and monitors the functioning of the physical
components such as oil pipeline segments, compressor stations, sucker rod pumps, etc. These
physical components require frequent patrolling to avoid physical attacks. Cyber attackers want
to compromise cyber targets to gain unauthorized privileges to control the functionality of the
physical processes. The physical attacker wants to attack oil pipelines to steal as much oil as
possible [9] [10]. To maintain the overall system protection/resilience, it is essential to protect:
1) cyber components from cyber attacks so that system engineers can take appropriate actions
during abnormal state and maintain the state information, and 2) the physical infrastructure
since any attack on it has a direct impact on the system’s resilience.
The system administrator of the oil pipeline is responsible for the allocation of the cyber
controls to protect cyber components and patrol teams from patrolling at different pipeline
locations. Due to limited budget, she prudently deploys resources to maximize the protection
of the system. In order to achieve the goal of efficient resource allocation for the administrator,
we tackle this problem by proposing a Stackelberg Security Game (SSG) of three players:
system administrator of the oil pipeline as a leader (defender), the cyber attacker and the
physical attacker as followers. The novelty of this approach lies in the formulation: 1) a real-
world problem of oil stealing and 2) a game which has two different types of targets being
attacked by the two different types of adversaries with distinct motives and who can coordinate
to maximize their rewards. The solution to this game assists oil pipeline engineers to choose
19
the cyber security controls for the cyber targets and to allocate patrol teams to the pipeline
regions efficiently. We provide a theoretical framework for formulating and solving the above
problem in Chapter 7.
1.5 Thesis Contributions and Outline
The primary motive of this thesis is to show how to defend ICSs, such as smart grid,
from CPAs by performing the end-to-end risk management. We perform risk assessment and
shown various ways to allocate resources to maintain smart grid resilience (reduce risk). We
answer the thesis question by the following contributions:
1. We present an Attack-Defense approach to evaluate the resilience of the grid in the
presence of multiple cyber-physical attacks [27] [28]. The key idea is to broaden the
area considered in the risk analysis. We include distinct functional components of the
grid (responsible for the grid stability) such as AMI, gas pipeline system [29] [30], and
DERs while performing risk anslysis. Through simulations, we demonstrate how
attacks on various functional components of the grid affect its resilience.
2. We present an end-to-end risk assessment methodology to quantify risk for a CPA on
the grid. This contribution focuses on the avoidance stage of resilience.
a. We show how to model the cyber domain of the grid as the Bayesian Network
to compute the likelihood of an attack [31].
b. We perform contingency analysis to quantify the impact of manipulating circuit
breakers [32].
c. Combine the likelihood and impact of manipulating circruit breakers to
compute risk [32] [33].
d. Discuss what do we mean by risk and how it is useful to power engineers.
20
3. We present three risk mitigation approaches:
a. Reinforcement learning based methodology to decide which system to scan or
patch in a particular system state [34]. This contribution focuses on the
avoidance stage of resilience.
b. We demonstrate how replacing a generator with a Photo Voltaic unit make the
grid resilient to circuit breaker manipulation attack [32]. This contribution
focuses on the containment stage of resilience.
c. We present a policy server to prevent malicious demand response commands
(from a compromised cyber system) from propagating to the physical domain
[35]. This contribution focuses on the containment stage of resilience.
4. We present a theoretical framework to formulate Power Storage Protection framework
against a fixed opponent (adversary) [36]. Our main motive is to mitigate attacks at
minimum operating cost and maintain the grid’s resilience. We fix the strategy for the
adversary and model the problem as a POMDP from the perspective of the power utility
and solve it using a POMDP solver. We provide experimental results to support our
claim using a simplified PSP scenario in which optimal policy is computed. This
contribution focuses on the recovery stage of resilience.
5. We present a theoretical framework for formulating and solving oil stealing problem
where cyber and physical attackers may coordinate and attack oil pipeline system
components to maximize their profit [37]. We propose an SSG of three players: system
administrator of the oil pipeline as a leader (defender), the cyber attacker, and the
physical attacker as followers. This game has two different types of targets being
attacked by the two different types of adversaries with distinct motives and who can
coordinate to maximize their rewards.
21
Overall, in this dissertation, we present a complete risk management methodology for
the smart grid system and show its usefulness in the real world. The approaches discussed in
this dissertation can be applied to other CPSs, and we recommend system engineers to deploy
them to understand the security status of the system and use the results of the risk assessment
to mitigate cyber-physical risk.
Thesis Outline
The rest of the dissertation is organized as follows: Chapter 2 discusses the related work
and problems unaddressed by the literature. Chapter 3 presents an Attack-Defense approach to
evaluate the resilience of the grid in the presence of multiple cyber-physical attack and defender
actions. In this scenario, an adversary performs three different types of attacks on three
different functional components of the grid. Through contingency simulations in the Power
World simulator, we analyze the impact of those attacks.
Chapter 4 describes the risk assessment methodology to evaluate the risk of a cyber-
physical attack. It explains how to model the cyber domain of the grid to compute the likelihood
of an attack and combine its results with the impact analysis to compute risk. Moreover, we
discuss what do we mean by risk and how it is useful to power engineers.
Chapter 5 describes three risk mitigation approaches. First, using a Reinforcement
learning-based approach to decide which cyber system to scan or patch in the grid’s cyber
domain. Second, how to mitigate the impact of manipulating circuit breakers in the physical
domain. Finally, a policy server in the cyber domain to prevent malicious demand response
commands from propagating to the physical grid.
22
Chapter 6 and 7 present approaches to solve two real-world problems: 1) what actions
admin should take to stabilize the grid at minimum cost, and 2) prevent coordinated cyber and
physical attacks on the oil pipeline system to prevent oil stealing, respectively.
Chapter 8 describes various real-world attack scenarios and blackouts. We show how
our research is useful in understanding those scenarios. Chapter 9 discusses the dissertation.
Chapter 10 concludes with a summary of the work and potential future work.
23
Chapter 2
Related Work
Over the past decade, there has been much work focused on resilience modeling, attack
detection, analyzing the impact of CPAs and efficient allocation of resources using a game
theoretic approach in the smart grid. Almost every work we investigated started by defining
the functionality of the grid, how it can be compromised in the cyber domain, how cyber attack
propagates to the physical domain and destabilize the underlying grid and how to allocate
resources either in the cyber or physical domain to protect against real-world adversaries.
Finally, they discuss what are the variables (or parameters) to consider in order to build robust
algorithms to maintain the resilience. Work targeting the smart grid security can be classified
based on the objectives of the research that is Risk Assessment and Resource Allocation to
mitigate risk. Researchers perform risk assessment either by evaluating the resilience of the
system in the presence of CPAs, or quantify the impact of an attack. In order to mitigate risk,
power engineers allocate resources efficiently in the cyber and physical domains. In this
chapter, we classify the related work according to Figure 2.1, which provides a high-level
overview of research objectives with the system under consideration, what attacks are
analyzed, approaches for resource allocation and how analysis is performed. We divide the
survey into two sections: Risk Assessment and Resource Allocation.
2.1 Risk Assessment
The first study that describes an efficient way to investigate the impact of cyber attacks
on the power system is provided by Stamp et al. [38] (2009). The authors presented the idea of
24
Figure 2.1: High level overview of related work objectives and approaches.
developing a cyber-to-physical bridge that links cyber attack vectors to the resulting events in
the power system. The cyber-to-physical bridge has four components: 1) attack vectors, 2)
outcome of attacks, 3) special effect in the grid, and finally 4) grid impacts. The authors adopted
the quantitative approach without data to estimate grid performance degradation in the presence
of cyber attacks and quantify the reliability. Through Monte-Carlo simulations, the security
metrics that describe the impact of cyber attacks on power system is computed. The reliability
is measured using indices such as Frequency of Interruption, Loss of Load Expectancy, etc.
This approach provides insights into system resilience in the presence of cyber attacks, but
does not consider the dynamic nature of attack vectors and does not explain how such attacks
are performed.
Neuman and Tan [3] (2011) adopted a qualitative approach to model CPAs on the smart
grid and described how threat propagates from one region to another. The paper represents new
classes of threats in CPSs (specifically in the grid) that are useful in characterizing the
interactions between domains. An attack in the cyber domain affects the physical infrastructure
25
and vice versa. The attack classes are: 1) cyber-cyber, 2) cyber-physical, 3) physical-physical,
4) physical-cyber, and 5) cyber-physical-cyber. The paper also states that the smart grid is a
federated system, and there are components which cannot be trusted. Power engineers should
create multiple protection domains, manage the flow of information across different functional
boundaries, and model domains as cyber-physical domains.
Cardenas et al. [39] (2011) demonstrated integrity attack on a chemical reactor and
showed how cyber attacks could affect the functionality of the CPS. They emphasized that the
interactions between the cyber domain and control system (physical domain) should be
considered to develop more resilient systems and detection algorithms. Through simulations,
they perform a risk assessment of the chemical reactor in the presence of integrity attacks.
Furthermore, they describe various attack detection methods and also design a response to
attacks to maintain system resilience. The paper provides a better understanding of the cyber
attacks on the CPS and covers different aspects of security (prevention, detection, and
response).
Liu et al. [40] (2011) presented coordinated switching attacks to evaluate the grid’s
resilience. The authors assumed that attackers gain control to the circuit breaker that connects
the portion of the load in the power system and they keep switching this circuit breaker until
the system is destabilized. Through power simulations, the paper demonstrated the
consequences of the attack on the power grid (using IEEE 9-bus model) regarding frequency
and voltage violations. The limitation of this approach is that authors do not explain the attack
path how attacker compromise the circuit breaker and if they want to perform such action, it is
necessary to consider communication network.
Sridhar et al. [14] [41] (2012) emphasized on the fact that power system operations are
dependent on the cyberinfrastructure for control and monitoring purposes. The use of
26
technology broadens the surface area of the attack and expose the system’s critical applications.
The authors provided a qualitative study of a wide variety of attacks such as integrity attacks,
replay attacks, and DDoS attacks on the power system and what are their consequences. In
[41], the authors provided a risk assessment methodology, which is then used to specify details
of generation, transmission, and distribution regions of the power system. The qualitative
approach in this paper demonstrated how important it is to combine the power application
security and cyberinfrastructure that supports power operations into a single risk assessment
methodology. Although the paper emphasized on the fact that it is essential to consider both
the cybersecurity and power security while assessing risk, it does not discuss how to model the
system to perform such analysis.
Srivastava et al. [42] (2013) used a graph-theoretic approach to analyze the
vulnerability of the electric grid with incomplete information. The paper provided details of
the vulnerabilities associated with the ICT and different types of attacks in the context of the
grid. For instance, the Aurora attack happens by penetrating a communication link from the
control center to generation, transmission relay, or Programmable Logic Controllers (PLCs).
Then, attackers inject false data to close generators. Through simulations on IEEE 14-bus and
IEEE 118-bus, the paper demonstrated the impact of integrated CPAs on the grid.
AlMajali et al. [21] [24] [43] (2013,2015) showed the consequences of the load drop
attack on the smart grid by defining the resilience when end nodes of the grid are compromised.
Through the combination of the PowerWorld and Network Simulator simulations, the authors
demonstrated how load drop attacks implemented over the AMI destabilize the underlying
power grid. Time to Criticality (TTC) parameter is identified via simulations, and it should be
considered while developing more resilient algorithms. In [43], the authors discussed how
Demand Response functionality could be used as a Spinning reserve to satisfy peak demand.
27
Moreover, they demonstrate how a DDoS attack on the AMI prevents legitimate demand
response commands to reach endpoints (smart meters), which further prevents load shedding
and curtailment in an area. Since demand is high and the system is not able to curtail load using
demand response functionality, the grid destabilizes due to under-frequency protection
threshold mechanism.
The Dynamic Load Altering Attacks (DLAA) are proposed in [22] (2015). The primary
motive of the attacker is to change the load to affect the resilience of the grid. The focus is not
just to alter the volume of load but the time over which it is changed. The authors formulated
the problem using linear power equations and presented two types of DLLA: open loop and
closed loop. Similar kind of attacks is shown in [21]. The authors in [21] performed a load drop
attack by compromising smart meters and using it to send the remote disconnect command.
Then, power is decreased over a period in PowerWorld Simulator to analyze the impact.
Therefore, the work in [22] shows similar results [21], but the unique part is that it considers
the open and closed loop concept.
Tan et al. [15] (2016) computed the optimal attack scenario using false data injection
attacks on the Automatic Generation Control (AGC). The cyber attacks on the sensor
measurement for AGC cause area frequency to go beyond thresholds and triggers remedial
actions such as islanding, load shedding, and partial or complete system, shutdown. The paper
provides an attack impact model that can demonstrate false data injections in the sensor
network and analyzed its consequences. Through simulations in the PowerWorld on the three
area 37-bus model and experiments on a physical 16-bus power system testbed, the paper
demonstrated the relevance of their optimal attack sequence and attack detection algorithms.
Pan et al. [44] (2016) introduced combined data integrity and availability attacks to
expand the attack scenarios against the power system state estimation. The primary motive of
28
the attacker is to manipulate the state estimator parameters to reduce the awareness of the
system while remaining hidden from the defender. The authors develop a power system and
state estimation model, communication model, and stealth attacks (remain hidden). The authors
have formulated Mixed Integer Linear Program (MILP) to quantify the vulnerabilities of the
electricity grid to combine data attacks. Furthermore, they provide mitigation schemes against
such attacks.
A risk assessment approach for power system considering the reliability of the
information system is presented by Lu et al. [45] (2016). Through simulations, authors
demonstrated that the failure of the information system brings more risk to the power system
operation. The line overload and risk of bus voltage out of range are defined and calculated as
risk indices. In this study, both software and hardware components are considered to quantify
the indices. The paper provided detailed insights into the impact of the information system
failure on the smart grid system, which was not performed earlier in the literature.
Panteli et al. [46] (2017) presented the resilience quantification framework for the
power system in the presence of extreme weather conditions. The operational resilience of the
system refers to the characteristics that secure the operational strength of the system, and the
infrastructure resilience relates to the physical strength that mitigates the damaged portion of
the system. The paper provided details about the various phases of the resilience trapezoid
(disturbance, post-disturbance, and restorative) and specifies resilience metrics corresponding
to each one of them. The amount of generation and demand are used as indicators for
operational resilience. On the other hand, some online transmission lines used as an indicator
of infrastructure resilience. The authors then used the concept of fragility curve to model the
transmission network in the presence of extreme weather events. Finally, the Great Britain
29
transmission network is used as a test network to apply the proposed framework. This paper
provides valuable insights into modeling CPSs in the presence of weather conditions.
Alohali et al. [47] (2017), the authors performed packet replay attack against the
authentication scheme over the AMI network. The packet replay attacks also drain the
resources similar to DDoS attacks on the system. The authentication scheme implemented in
this paper uses multi-hop path to reach to the authentication server. The intermediate nodes
partially process each packet before forwarding it, and therefore, there is an end-to-end delay
and energy consumption on the server. The results of the study show that there is an increase
in authentication time due to replay attacks.
Anas et al. [17] (2016) modeled the propagation of the worm in the AMI network.
Through simulations in the Network Simulator, two parameters have identified that affect how
fast worm propagates within AMI. First is transmission range of meters and second is the size
of the worm. They modeled the worm propagation using the Weibull distribution that is the
time needed for a smart meter in AMI to transit from normal to malicious state. Finally, the
paper discussed what physical actions could be performed and its consequences by the worm
once it is installed in smart meters.
Dabrowski et al. [48] (2018) studied the effect of power demand increases caused by
remotely activating CPUs, GPUs, hard disks, screen brightness, and printers on the frequency
of the European power grid. They demonstrated that by controlling these functionalities, an
adversary manipulates the power demand and destabilize the underlying grid. Similarly, Soltan
et al. [49] (2018) demonstrated how an adversary manipulates the power demand by controlling
high wattage IoT devices via a botnet. Both papers focused on analyzing the impact of attacks
on the grid using the quantitative approach.
30
Modeling and simulation [50] (2004) of interdependencies in these systems can provide
insights into the complex nature of their functions, behaviors, and operational characteristics.
The critical infrastructures are interdependent in 4 ways: physical, geographical, cyber, and
logical. Such distinctions demonstrate that a change in one can bring the change in another.
Shahidehpour et al. [51] (2005) analyzed the short-term impact of the natural gas on power
generation scheduling by formulating an optimization problem that tends to minimize the
power generation cost.
Manshadi and Mohammad [52] (2015) proposed a methodology to identify the
vulnerable components within microgrid infrastructure (consists of natural gas, smart grid, and
heating system) and how many disruptions can affect the resilience of such multi-energy
systems. The paper fails to describe how such disruptions can happen. The authors assumed
that there is an increase in cost due to disruptions and formulated a cost optimization problem
to see the effect of disruption cost. The requirement is to define the attack path within the
system of systems so that system administrators can either block the path or isolate the system
so that the attack does not propagate to other parts of the systems.
Tao et al. [53] (2008) demonstrated attacks by performing the outage of pipelines or
electricity lines without defining how an attacker accomplishes those outages. Yuan et al. [54]
(2014) proposed a methodology to identify and protect vulnerable components of connected
gas and electric infrastructures from malicious attacks. The authors have formulated the tri-
level optimization problem (defender-attacker-defender), which minimizes the cost of
disruption and damage.
Erdener et al. [55] (2014) described an integrated simulation model for analyzing
electricity and gas systems. They have assumed random failures in the power network and
examined the effect of that failure on the gas system and vice versa. The problem with their
31
approach is that they have assumed random failures of components that provide insight over
the interdependency of the networks but failed to describe how those failures arise. The current
literature does not provide the details about the smart grid resilience in the presence of multiple
CPA on its various functions and components. We need to incorporate direct and indirect
attacks on various functions of the grid in our attack model.
Laprie et al. [56] (2007) described the qualitative model for modeling the
interdependencies between the power network and information infrastructure. The authors
have modeled the behavior of interdependent infrastructures by analyzing the impact of the
cascading, escalating, and malicious attacks. The main idea is to formulate the natural gas
network that is connected to the power grid and demonstrate how energy transformation occurs
at the combined nodes. Similarly, in [89] (2012), the authors presented cascading failure in the
power grid under various attack strategies.
To capture the interdependencies among infrastructures, a graph-theoretic approach is
presented by Wu et al. [57] (2016) between oil pipelines and power networks. Physical and
functional interdependencies connect both the networks. The attacks are performed on the
network by attacking the small fraction of nodes and analyzing its effect on the connected
network. These sources do not address how nodes in the oil and power network fail. The paper
provides relevant insights about the natural gas and power generation interconnection but the
question is how disruptions occur and why they are essential.
Xiang et al. [58] used a probabilistic model to study the effect of cyber attacks on circuit
breakers. This model takes into consideration different factors that influence any attack, such
as the vulnerabilities in the system, the required attack steps and resources of the attacker. By
quantifying those factors, the reliability of a test system was measured.
32
Rahman et al. [59] proposed an algorithm to detect faults and cyber-attacks and
distinguish between them. The algorithm uses the circuit breaker status and current flow
measurements to determine if there is an attack or only a fault in the system. This kind of
analysis can be used to detect load altering attacks described by Amini et al. [22].
Kang et al. [60] implemented a man-in-the-middle (MITM) attack on a modeled power
system that utilizes the IEC61850 standard. The modeled power system includes a Photo
Voltaic (PV) system. The attacker manipulates the configuration of the PV system (e.g., power
limits) by performing a MITM attack between the SCADA system and the PV system forcing
the PV inverter to switch off.
Liu et al. [61] (2017) proposed a risk assessment method for evaluating the cyber-
security of a microgrid that has a solar PV control system. The system was modeled using a
Markov process where it can be in one of three primary states. Transitions between states occur
at different rates (i.e., failure and repair rates), which are estimated using attack parameters like
attack duration and detection time. Finally, the attacks are quantified using the economic cost
of the attack. The main drawback is how to estimate the attack parameters.
The concept of defining resilience using Bayesian Network (BN) has frequently been
used in securing engineered systems such as in [62][63][64]. Li. et al. [62] (2016) described a
three-layer framework that assesses the potential risks which apps introduce within the Android
mobile system. They formulated three layers called as static, dynamic, and behavioral analysis
network where risks are identified, and their propagation through each layer is modeled as
Bayesian Risk Graph (BRG). Hosseini et al. [63] (2016) defined a BN framework with a motive
to compute the resilience of the inland waterway system. Poolsappasit et al. [64] (2012)
modeled IT infrastructure using BN that enables system admin to quantify the chances of
network compromise.
33
2.2 Resource Allocation
A DDoS attack on the remote monitoring smart meters of the power grid, described by
Husheng et al. [65] (2012), can reduce the situational awareness due to which the system admin
cannot respond to attacks immediately. The authors formulated the jamming and anti-jamming
game as a zero-sum stochastic game. When remote sensors are jammed, the state information
is not delivered to the control system. Also, sensors utilize other multiple channels to avoid
jamming interference. The actions of the attacker are to decide which sensors to choose to jam
the signals. The Nash Equilibrium is computed to demonstrate the increased rewards when
effective anti-jamming signals are sent. In such attack scenarios, the primary questions
addressed using game theory are identifying the sensitive points in the grid and where to place
the security controls.
Shelar and Amin [66] (2015) proposed a sequential Stackelberg Security Game (SSG)
to analyze the vulnerability of radial electricity distribution networks to disruption in DER
nodes. The primary motive of attackers is to remotely manipulate the set point of the inverter
and cause loss of voltage regulation. In response to this attack, power utility (defender)
performs load curtailment to meet voltage and frequency requirements of the system. The
attacker wants to maximize the cost incurred to the defender by performing an attack. The
defender’s motive is to develop a strategy for optimal resource allocation to minimize the
impact of an attack. In this game, the attacker is the leader of the SSG, and the defender is the
follower. The equilibrium strategies are discovered, and the response of the defender against
an attacker is discussed. The paper provides an understanding of how optimal responses of the
players are computed in the context of ICS and how they help develop solutions to the
problems.
34
Srikantha and Kunda [67] (2016) formulated a differential game that demonstrated
stealthy strategies for attackers to disrupt transient stability by leveraging control over DER.
Using game theoretic simulations, they showed that if power utility identifies the
uncompromised components, it is possible to reduce the impact of attacks for a fixed interval
of time.
Q-learning based vulnerability analysis of the smart grid system is proposed by Yan et
al. [68] (2017) to model the sequential topology attacks. The topology attacks can create
significant disturbances by manipulating the smart grid connectivity. The attackers manipulate
the system control commands and sensor measurements. The primary motive is to come up
with an action sequence for an attacker using the Q-learning algorithm that can produce a
critical system failure where a fatal fraction of lines is out of service. The algorithm provides
the least number of lines that should be attacked to maximize the rewards for an attacker.
Through simulations, the algorithm can find out the vulnerable sequences that led to critical
blackouts in the power system. The paper helps system administrators to identify critical
components in the attack schemes.
The papers described above combine the ICS processes and game theoretic algorithms
to model different problems in the field such as DER compromise, and allocation of security
resources to minimize the attack and cost of deployment. Such models are unique to the
problems discussed in the paper. They do not generalize to other problems in the field. So,
therefore, there is an opportunity to develop resilience models that can cover a family of
challenges in the area of cyber-physical systems in general.
SSG have been employed in various fields and domains such as in
[69][70][71][72][73][74]. Jain et al. [71] (2008) described the Bayesian SSG, which is
formulated to optimize the defender’s strategy under the uncertainty of different types of
35
adversaries. The authors captured the interaction between the defender and attackers in the
form of Mixed Integer Quadratic Program (MIQP). They provided the steps to convert the
MIQP into MILP and demonstrated the relevance of the algorithm by deploying it at LAX
airport. The problem with the above algorithm is the increase in the size of the SSG state space
with the increase in input. Tsai et al. [72] (2009) presented ERASER the fastest known
algorithm for solving the class of SSGs, where the state space is huge. The authors used an
efficient representation of MILP for modeling SSG to overcome runtime issues caused due to
the problem mentioned above. Furthermore, they modeled the game with the defender’s actions
that allow them to resolve the scheduling constraints issue without combinatorially exploding
the state space of actions.
So far, we discussed algorithms that concentrate on modeling the behavior of rational
adversaries and computing the resource allocation against them. Researchers have also worked
in modeling the bounded rationality of adversaries. While modeling the bounded rationality of
an agent, it is essential to consider the effect of results of past actions on the future actions of
an adversary. Kar et al. [73] (2015) proposed a new model called a Stochastic Human behavior
model for Attractiveness and Probability weighting (SHARP). SHARP considered the results
of the past actions of an adversary in future actions. Furthermore, it addresses the shortcomings
of probability weighting function in existing human behavior models. The authors used the
Subjective Utility Quantal Response (SUQR), model. SUQR used the subjective utility
function, which is the linear combination of key features essential for an adversary to make
decisions. SHARP model is tested on the wildlife dataset where attackers are poachers, and
defender wants to reduce poaching activities and catch poachers by patrolling in different
wildlife regions. Researchers have also demonstrated the coordination between poachers in
[74] (2016). The authors formulated a collusive game in [74] between the set of poachers who
want to coordinate with each other against the defender. They performed human subject
36
experiments on Amazon Mechanical Turk and demonstrated the interaction between the
players.
The authors in [109] (2017) provided a game-theoretic framework for resilient and
distributed generation control for renewable energies in microgrids. Moreover, some authors
proposed a distributed non-linear robust controller in [75][76] (2017) to improve the transient
stability of synchronous generators against excessive communication delay and cyber-physical
disturbances. A real-time cascading failures prevention multi-agent system is proposed in [77].
The paper deploys a multi-agent system to prevent cascading failures without performing load
shedding. In [78], the authors proposed an IDS for the neighborhood area network in AMI,
which is based on signature-based and anomaly-based methods. IDS detects different types of
attacks on smart meters and AMI in general, such as signal jamming, node compromise,
resource exhaustion, data injection, etc. The question arises what if an adversary can
compromise a smart meter collector or cyber node (such as DRAS) in the grid. Through DRAS,
an adversary sends legitimate commands to destabilize the grid. As compared to [78][79], our
motive is to prevent a compromised cyber system from sending legitimate commands that will
put the grid in an unsafe state and start cascading failures; not when a cascading failure has
already started.
Ryutov et al. [79] (2015) presented a server that provides a security mechanism to
monitor and control load as per the security policies during normal operations as well as in the
presence of load-altering attacks. The results of the study state that it is critical to determine
whether to authorize the DR/AMI commands when a system is reaching critical states and in
response to this, whether to take actions to oppose this change in the system. As we mentioned
that the motive of an adversary is not always to destabilize the grid; instead, she can increase
the operational cost of the grid. So she sends malicious DR disconnect commands to a certain
37
number of homes in an area which will be authorized by the policy server because the system
will be in a safe state after implementing them but increase the operational cost in terms of
customer dissatisfaction. The limitation of this approach is that it does not consider the
malicious commands that will not physically put the grid in an unsafe state. Moreover, the
paper does not discuss the more general case where the power utility is compromised and sends
malicious commands, different categories of attacks it prevents and what factors need to be
considered for the implementation of such system.
Table 2.1 and 2.2 list the summary of the risk assessment and resource allocation related
work respectively. We define what is the main approach, cyber-physical attacks, focus, and the
systematic approach to understand what is defined in the focus.
38
Table 2.1: Summary of the risk assessment related work approach, focus, and the cyber-
physical attacks or systematic approach to understand what is defined in focus.
Risk Assessment
First Author System Approach Focus Cyber-physical Attacks/Approach
Stamp [38] Smart grid Quantitative Impact Cyber attack to control generators
Cardenas [39] Chemical
Plant
Quantitative Impact Data Integrity attack
Liu [40] Smart grid Quantitative Impact Corrdinate switching attack to control
circuit breakers to control load
Srivastav [42] Smart grid Quantitative Impact Graph Theoretic Approach
Anas [43] Smart grid Quantitative Resilience Load Drop Attack
Amini [22] Smart grid Quantitative Resilience Dynamic Load Altering Attack
Tan [15] Smart grid Quantitative Impact False Data Injection Attack
Pan [44] Smart grid Quantitative Impact Combined data integrity and
availability attack.
Lu [45] Smart grid Quantitative Risk (no
likelihood)
Approach to model the reliability of
information from IT.
Panteli [46] Smart grid Qualitative Resilience During extreme weather conditions
Alohali [47] Smart grid Quantitative Impact Packet replay attack over AMI
Dabrowski [48] Smart grid Quantitative Resilience Manipulating demand using CPUs,
GPUs, etc.
Soltan [49] Smart grid Quantitative Resilience Manipulating Power Demand using
high wattage IoT devices
Rinaldi [50] Cross Domain Qualitative Analysis Understanding interdependencies
among cross infrastructure
Manshadi [52] Cross Domain Qualitative Minimizing
Cost
Optimization problem to take efficient
actions against disruptions
Laprie [56] Cross Domain Qualitative Analysis Modeling interdependencies between
power and information domain
Wu [57] Cross Domain Quantitative Resilience Graph Theoretic Approach between oil
pipeline and grid.
Xiang [58] Smart grid Quantitative Impact Probabilistic model to analyze attacks
on circuit breakers
Li [62] Android Quantitative Risk (only
Likelihood)
Bayesian Network
Hosseini [63] Waterway
Inland
Qualitative Risk (only
Likelihood)
Bayesian Network
Poolsappasit [64] Information
Technology
Quantitative Risk (only
Likelihood)
Bayesian Network
39
Table 2.2: Summary of the resource allocation related work approach, focus, and the
systematic approach to understand what is defined in focus.
Resource Allocation to Reduce Risk
First Author System Approach Focus and Approach
Hushang [65] Smart grid Quantitative Reducing risk via anti-jamming signals;
Game theory approach to model DDoS attack.
Shelar [66] Smart grid Quantitative Minimize impact of manipulating set point of inverter;
Stackleberg Security Game Approach
Srikantha [67] Smart grid Quantitative Demonstrate stealthy strategies to disrupt transient
stability via DER and how to prevent such attacks;
Differetial Game Approach
Yan [68] Smart grid Quantitative Model sequential topology attacks in the grid.
Q-learning based approach
Nguyen [69] LAX Quantitative Compute optimal defender strategy to patrol LAX
security check points.
Bayesian Stackleberg Security Game
Yang [70] LAX Quantitative
Pita [71] LAX Quantitative
Kar [73] Wildlife
Domain
Quantitative Compute optimal defender strategy to patrol wildlife area
to catch poachers.
Subjective Utility Quantal response
Collusive game between various attackers
Gholami [74] Wildlife
Domain
Quantitative
Ayar [75] Smart grid Quantitative Improve the transient stability of synchronous generators
against excessive communication delay and cyber-
physical disturbances.
Non-linear robust controller
Ayar [76] Smart grid Quantitative
Beigi [78] Smart grid Quantitative Intrusion Detection system
Ryutov [79] Smart grid Qualitative Policy server to prevent attacks on demand response
functionality.
40
2.3 Related Work Analysis
In this section, we analyze the related work discussed above. Almost every work we
investigated included some evaluation of CPAs on a specific system or system component and
discuss how to allocate resources efficiently in response to adversarial actions. We focus on
the limitations of previous efforts and how can we fill those gaps through our research.
Risk Assessment. Researchers have focused on analyzing the impact of CPAs on one
of the functions of the grid [3][21][43][44]. Today’s attackers use multiple attack vectors as in
the case of Ukraine power grid attack [5]. We need to incorporate attacks on various direct and
indirect functions and components of the grid. Such a detailed description about evaluating the
grid’s resilience in the presence of multiple simultaneous CPAs is mostly unaddressed by the
current literature. The primary contribution is to address the question: “How can we evaluate
the resilience of a given Smart Grid system in the presence of multiple simultaneous cyber-
physical attacks on its distinct functional components?” In this dissertation, we focus on attacks
that manipulate power supply (such as peaker power plant, DERs, gas distribution pipeline
supplying gas to peaker power plant) as compared to most of the work in literature that focused
on manipulating load.
Our contributions are twofold. First, we evaluate the resilience of the grid in the
presence of multiple simultaneous CPAs. We show that it is essential to consider multiple
attacks on the grid while performing resilience analysis and how it benefits power engineers to
understand the system dynamics in the presence of ongoing attacks. Second, we consider the
attack scenarios on the power system, natural gas distribution pipeline, and communication
network. We consider: 1) worm propagation attack to compromise smart meters that control
DERs remotely and manipulate the generation; 2) the pressure integrity attack on the natural
gas distribution pipeline and 3) DDoS attack on the pipeline communication network. We use
41
the output of these attack scenarios as input to the Power World Simulator and see how such
attacks destabilize the underlying power system. Furthermore, we present security metrics that
should be considered by system engineers to build more robust and resilient power systems.
The analysis of multiple attack scenarios on the grid helps system engineers to develop more
resilient systems and improves the response of the system to ongoing attacks. We discuss this
contribution in Chapter 3.
The related work also describes various methods to assess the security of the gas
pipeline systems [51][53][55][57] but none of them described how to evaluate its resilience,
which is one of the essential functional components of the grid (when natural gas pipeline
provides fuel to peaker plants), in the presence of attacks. We show how to use the Function-
based methodology [24] to perform resilience evaluation of a gas pipeline system [29][30]. We
discuss this contribution in Chapter 3.
The approaches discussed in section 2.1 demonstrate how physical power system gets
affected because of the changes in one of the functions of the grid due to cyber attacks, but
they have failed to capture the dynamic behavior of attacks which are based on vulnerabilities
associated with the system components under attack. The related work does not discuss two
important factors to compute risk: how to estimate the likelihood of an attack and its impact?
First, most of the work in the literature focused on analyzing the impact of CPAs on the
grid by assuming attacks have already happened or succeed such as in
[22][38][40][43][49][60][61]. The discussion of how attacks happen, what is the probability of
compromise based on realistic scenarios, and how the likelihood of an attack affects the
estimation of risk have not been considered in those efforts. On the other hand, another group
of researchers built certain parts of the system and exploited a specific vulnerability to
42
demonstrate a cyber-physical attack [15][39][44][48]. The main problem with this approach is
that it is not repeatable and addresses a few vulnerabilities.
Second, the impact of an attack was studied by assumming that a component fails (e.g.,
a generator or a power line [38]), and then specific system attributes are measured to analyze
the impact such as system frequency [22][40][48][49][58][61].
In this dissertation, we demonstrate how to perform the end-to-end risk assessment
methodology to compute the risk of a CPA. Our research is different from current approaches
in a way that we compute risk by combining the likelihood of an attack and its impact. In order
to compute the likelihood of an attack, we present a tool [31] to model the cyber domain of the
grid. We recommend our tool to power engineers for not just to monitor the vulnerability status
of its cyber functions and network components, but also to compute the probability of
compromise of each function from where CPAs originate. Furthermore, the tool assists them
in performing an in-depth study from the functional level to vulnerability level of any function
of the grid. We use the Common Vulnerability Scoring System (CVSS) [80] to compute the
probability of compromise of different vulnerabilities. Then, we use those probabilities in a
Bayesian Network to compute the probability to compromise ECC, which controls circuit
breakers.
In this analysis, we focus on the known vulnerabilities and compute their probability of
exploitability. We do not consider unknown or zero-day vulnerabilities. So, the probability of
attack represents the minimum likelihood of compromising a function. By considering known
vulnerabilities, we document and manage risk, which is known in general and exploited by
attackers in early stages of attacks. Finally, using contingency simulations, we analyze the
impact of manipulating circuit breaker on the underlying power system using Power World
simulator. The probability of compromise and quantified impact is used to estimate the risk of
43
a cyber-physical attack on the grid. What distinguishes our analysis from the current literature
is that: it is a repeatable methodology that can be used to assess the risk of cyber-physical
attacks on the smart grid and other CPSs. We discuss this contribution in Chapter 4.
Resource Allocation. Very few [75][76][77][78][79] focused on designing a system
that prevents attacks on the grid and maintains its resilience. For instance, suppose an adversary
compromises either Demand Response Automation Server (DRAS) or a cyber system in power
utility (which controls demand response functionality). She sends malicious commands to
perform load curtailment or load shedding in a zip code to manipulate the power demand and
supply, thus destabilizing the grid’s resilience. The question is how to prevent such undesired
malicious commands to go through? An adversary may behave bounded rationally, and instead
of destabilizing the grid, she increases the cost of engineers in terms of customer dissatisfaction
by issuing remote disconnect commands to customers. Thus, there is a need to design an
intelligent system that evaluates whether commands issued by a cyber system, which may or
may not be compromised, put the grid in a safe state or not, and accordingly it allows
commands.
In this dissertation, we present an Intelligent Governor for Cyber-Physical Systems
(IGNORE) [35] to limit the success of attacks when a cyber system has been compromised and
leveraged by an adversary to mount attacks on the physical system. We describe the
methodology to build IGNORE system for any cyber-physical function of a CPS and present
its usefulness in the smart grid infrastructure by developing a Governor for the grid’s Demand
Response functionality. An essential characteristic of IGNORE is that it evaluates the
requirement and safety property of the commands issued by a cyber system (which may or may
not be compromised). Based on this evaluation, it allows commands. We are among the first
to design an intelligent system to prevent malicious power supply and load altering attacks
44
(from compromised DR) from propagating to the physical domain. Our crucial contribution is
to understand how to design such system for the grid DR functionality, where to place
governor, and discover factors that should be considered by power engineers while developing
this system and show its effectiveness through empirical results. We discuss this contribution
in Chapter 5.
Some researchers have focused on protecting infrastructure and other domains from
physical attackers, including airports [69][70][71][72] and wildlife [73][74]. Researchers have
not yet considered oil stealing from pipelines. The question is: what happens when the cyber
and physical attacker launch a coordinated attack against an oil pipeline system, which is more
dangerous and applicable in real-world [7][8][9][10][11]. There is a need to develop a
methodology so that oil pipeline engineers take decisions in real time to protect the cyber and
physical targets from cyber and physical attackers by allocating resources prudently.
In this dissertation, we formulate the oil-stealing game between attacker (both cyber
and physical) and defender [37]. Both the types of attackers receive separate utilities for
coordination, which depends on the attacker’s ability and knowledge to attack and coordination
agreement between them. We are motivated by the formulation of the Bayesian SSG in [71].
We formulate a MIQP, a Stackelberg game, to solve the problem of efficient resource
allocation in the cyber and physical domain of the oil pipeline cyber-physical systems. The
solution to this game assists oil pipeline system administrator to allocate the cyber security
controls for the cyber targets and to assign patrol teams to the pipeline regions efficiently. This
work provides a theoretical framework for formulating and solving the above problem. We
discuss this contribution in Chapter 6.
Most of the research efforts are dedicated to different components of the grid in the
presence of CPAs, such as on AMI, DR, AGC, etc. The risk assessment of the power grid
45
focusing on the energy storage, which is essential but mostly unaddressed by the current
literature. Our work also concentrates on what actions the defender should take to meet power
demand at a minimum operating cost in the presence of CPAs on the grid’s power and
information infrastructure. We discuss the formulation of the Power Storage Protection (PSP)
framework against a fixed opponent (adversary). We fix the strategy for an adversary and
model the problem as a POMDP from the perspective of the defender (power utility). Using
this approach, the defender computes what actions to perform in order to recover (stabilize the
grid) from CPAs at minimum operating cost. We provide a theoretical framework for
formulating the above problem and provide experimental results to support our claim using a
simplified PSP scenario. We discuss this contribution in Chapter 7.
46
Chapter 3
Smart Grid Resilience: Attack-Defense Approach
In this chapter, we present an Attack-Defense approach to evaluate smart grid resilience
in the presence of three CPAs on its distinct functional components. Our contributions are
twofold. First, we evaluate the resilience of the grid in the presence of three CPAs. We show
it is essential to consider multiple attacks (which target systems that provide stability to the
grid) on the grid while performing resilience analysis and how it benefits power engineers to
understand the system dynamics in the presence of ongoing attacks. Second, we demonstrate
attacks on the power system, natural gas distribution pipeline, and communication network.
We consider: 1) worm propagation attack to compromise smart meters that control DERs
remotely and manipulate the generation; 2) the pressure integrity attack on the natural gas
distribution pipeline to prevent peaker power plant from generating power and 3) DDoS attack
on the gas pipeline communication network. Through contingency simulations in the Network
Simulator (NS2) and Power World Simulator, we analyze these attacks that propagate from the
cyber domain to the power system and discuss how such attacks destabilize the underlying
power grid. Finally, we present security metrics that should be considered to build more robust
and resilient power systems.
3.1 Objective
The main objective of this chapter is to address this question: “How can we evaluate
the resilience of a given smart grid in the presence of multiple cyber-physical attacks on its
distinct functional components?”
47
Figure 3.1: Attack-Defense Approach.
Figure 3.1 describes the attack-defense scenario. First, an adversary performs a worm
propagation attack that compromises smart meters to control DERs remotely. In response to
this attack, power engineers (defender) instruct Gas-Fired Peaker Plant (GFPP) to produce
power to meet the unsatisfied demand. Second, in response to defender’s actions, an adversary
performs a pressure integrity attack on a segment of a natural gas transmission pipeline that
delivers gas to the GFPP. Power engineers will send control signals to remote terminal units
and compressor station to maintain the gas pressure. Finally, an adversary performs a DDoS
attack on the pipeline communication network to reduce the response of the system. We use
the combination of these simulations with the Power World simulation and see step-by-step
how such attacks destabilize the underlying power system. In the following sections, first, we
discuss attack scenarios corresponding to each function, how resilience is affected if those
functions are compromised, and simulation setup. Second, we present the analysis
methodology. Finally, we conclude this chapter by discussing the results of the simulation and
how it is useful to power engineers.
48
3.2 Worm Propagation Attack
Attack Modeling. Smart meters are installed at the customer’s premises that are
responsible for two-way communication with the power utility. They are accessible over the
Wireless Mesh Network (WMN). The energy storage devices, such as DER are attached to
smart meters. The power utility sends control signals to smart meters to dispatch power from
DERs or to operate as an independent operator. Since smart meters are connected to WMN,
they are accessible over the network and exposed to CPAs. We simulate the worm propagation
attack on the AMI network, which consists of smart meters. In this attack scenario, we assume
that an adversary has access to one of the houses in a particular area and infected the meter of
that house. The worm installed at the smart meter now seeks to infect other meters present in
the neighboring area and execute the payload it carries. We assume that the payload worm
carries prevents meters from dispatching DER stored energy. Since power utility is unable to
dispatch the stored energy, it is not able to meet the power demand. When generation is not
able to meet demand, the power line frequency in the area decreases and may cross under-
protection frequency threshold. An adversary performs this attack by leveraging the
information from the Independent System Operator (ISO) [81] during peak demand hour to
aggravate the impact of the attack on the system. In response to this attack, power engineers
instruct GFPP to produce power. To counter defender’s actions, an adversary performs a
pressure integrity attack on the gas pipeline system to prevent the natural gas reaching peaker
plants.
Simulation Setup for Worm Propagation. We model the WMN using the Network
Simulator (NS2) as described in [17]. There are 100 meters, each representing a residential
house, with a gateway placed at the center of the simulation area, refer to Figure 3.2. Each
meter is configured with the transmission rate of 1Mbps, and Ad-hoc On-demand Distance
49
Vector (AODV) is used as a routing protocol [82]. The meters send their readings to the
gateway server. We infect one meter with the worm of size 1 KB and the transmission range
among meters is 100 m. These are the two factors that are responsible for the speed at which
worm propagates throughout the network. If worm size or distance between two meters is
small, worm propagates quickly.
Figure 3.2: Demonstration of the distribution of meters in the simulated area [17].
Figure 3.3: Oil and Gas Cyber-Physical System Attack Graph.
50
3.3 Attack on Gas Pipeline Systems
In this section, we use the function-based methodology to evaluate the resilience of gas
pipeline systems under pressure integrity attack. We combine the results of simulations of a
wireless mesh network for remote terminal units and of a gas pipeline simulation to measure
the Time to Criticality (TTC) parameter; the time for an event to reach the failure state. For the
attack-defense scenario, we use the results of this simulation.
Attack Graph. Figure 3.3 represents the attack graph showing how cyber attacks
propagate from the cyber to the physical domain and their consequences. The attack graph is
divided into four sections: cyber attacks, physical impact, effects, and results. The cyber attack
column represents compromised targets such as headends, smart meters, compressor stations,
etc. Once a cyber system is compromised either due to malicious or non-malicious means, it
has an impact on the physical domain. The physical impact column represents the functions
that are affected due to cyber attacks such as loss of oil and gas transmission and distribution
control. The disturbances in the physical functions have consequences on the system’s
functionality which result into equipment damage or delay in gas delivery.
Consider a cyber-attack on the sucker rod pump system which affects oil generation in
terms of the loss of crude oil production. The loss of production, in the long run, leads to
dissipation of the oil storage capacity and ultimately unsatisfied customer demands.
Alternatively, consider an attack on compressor stations which affects the gas distribution
pipeline. Once such pipeline system is affected, the gas is not delivered to the processing plants,
industries and finally, to customers. By understanding this attack graph, the system
administrator can identify possible attack scenarios on different functions of the oil and gas
systems.
51
Figure 3.4: Attack Tree based on function-based methodology.
3.3.1 Function-based Methodology
The attack tree in Figure 3.4 represents the idea of the function-based methodology
[24]. The motivation behind this approach is to narrow the focus to a particular function of the
system. The function can be affected in malicious or non-malicious ways which are abstracted
by this approach. We consider the important function of the gas pipeline system; gas delivery.
We build an attack tree to see how gas delivery is affected by CPAs. The first level of the attack
tree represents the primary function of gas delivery, which attackers want to affect. The second
level represents the impact on the physical system as a change in gas pressure or failure of gas
pipeline nodes. The third level represents the cyber attacks that affect the physical system on
the second level. Finally, the fourth level represents how cyber attacks are performed. The
fourth level of the attack tree represents the cyber attack column in Figure 3.3.
The attack scenario we considered is the combination of pressure integrity and a DDoS
attack to affect the high-pressure natural gas pipeline system. An adversary compromises a
system in the cyber domain from where she reprograms remote terminal units to show
misleading system readings to the SCADA system and instructs a compressors station to
52
change the pressure of the natural gas flowing through a pipeline. Simultaneously, she performs
a DDoS attack on the communication network to reduce the response of the system.
3.3.2 Pressure Integrity Attack
The sudden change in the pressure of gas flowing through a pipeline can affect the
internal pipeline coating, and if the pressure goes beyond the Maximum Allowable Operating
Pressure (MAOP) [83], pipeline segments are closed via Remotely Controled Valves (RCVs)
[84] or may rupture. In this attack scenario, we describe how an adversary affects gas delivery
by increasing the gas pressure maliciously through a pipeline. Similar to the Ukraine power
grid attack [5], this attack scenario demonstrates multiple capabilities of the attackers such as
compromise of remote terminal units and SCADA system component, and DDoS attack, with
the motive to affect the pipeline infrastructure and gas delivery.
Attack Scenario. In the pressure integrity attack, an adversary performs two types of
cyber attacks on a pipeline infrastructure. First, she compromises a percentage of RTUs
(including RCVs) present along a pipeline segment and compressor station to reduce the
situational awareness of the pipeline segment maintained by the SCADA system. She then
instructs a compressor station that pressure at which the natural gas should be delivered has
increased. When most of the RTUs along a pipeline segment is compromised, the SCADA
system cannot determine the actual pressure at which compressor station is pumping the natural
gas into that pipeline segment. Compressor station will increase the pressure so that the natural
gas can move through a pipeline and meet the delivery pressure at the sink node. Every pipeline
has [83] MAOP below which gas usually flows without affecting the pipeline. SCADA system
will determine that something is wrong with a particular pipeline segment after some delay.
Compressor station adjacent to the compromised compressor station will detect the change in
the pressure of the gas delivered and notify the SCADA system. To reduce the pressure of
53
natural gas at the pipeline segment and to increase the situational awareness, SCADA system
starts sending control signals to RTUs and compressor station. Once remote terminal units are
compromised, the attacker performs a DDoS attack (which we describe in section 3.4) on the
wireless mesh network to reduce the response of the SCADA system. The questions to answer
at this stage are:
1. What is the percentage of RTUs compromised and for what time period?
2. What is the total amount of pressure increased between two end points given some
percentage of compromised RTUs?
3. Time to Criticality that is time before which SCADA system should react.
We need to model multiple systems in this attack scenario. Therefore, we have used
separate simulations for attacks on RTUs, network communication, and pipeline system.
Finally, we combine the results of the wireless network and gas pipeline simulation to measure
the TTC parameter.
Gas Pipeline Model. We choose Pipeflow software to model the behavior of a pipeline
and to form a concept model for simulation (see Figure 3.5). The motive of the pipeline is to
deliver natural gas at the sink placed at a distance of 160 miles from the source. Compressor
station is positioned at 96 miles from the origin and 64 miles from the sink to maintain
consistent pressure of the gas. The diameter of the pipeline is 15.2 inches. The pressure to
deliver natural gas at the sink should be 900 psig. MAOP of the pipeline is 1200 psig (which
is MAOP of high-pressure transmission pipeline [83]). These are the pressure values in the
absence of the cyber attack. If the pressure is above MAOP, many unwanted scenarios may
arise, such as pipeline internal coating damage due to pressure increase or immediate valves
closure, leakage, or the pipeline can explode. Compressor station should pressurize the gas at
1196.61 psig to deliver gas at 900 psig at the sink node. Compressor station can change the
54
pressure according to the delivery requirements. The difference in pumping and delivery
pressure is because of energy loss due to friction when gas flows through a pipeline. The above
software provides us the details of the pipeline model but does not allow us to simulate attack
scenarios. Therefore, we consider different rates manually at which pressure increases: 1)
instantly, 2) linearly and 3) logarithmic.
Simulation Setup. We model a WMN (sensor network as in [85]) using the Network
Simulator (NS) between compromised cyber system node & remote terminal units and SCADA
system & remote terminal units. Remote terminal units present along the pipeline, are capable
of communicating with the SCADA system over a WMN. Each RTU contains a meter which
captures the physical properties and sends signals to the SCADA system. We have modeled
each remote terminal unit as a meter in a WMN. The configuration [86] of meters operating
over a radio network is:
1) Radio Frequency: 900MHz
2) Data Rate: 10 MB
3) Transmitter Output: 30 dBm
4) Receiver Sensitivity: -97 dBm
Figure 3.5: Gas Pipeline Model and Attack Scenario.
55
Figure 3.6: DDoS attack on gas pipeline system to reduce the response of the SCADA.
The shadowed [87] propagation model is used to simulate outdoor communication
because it predicts the mean received power and computes its variation at a certain distance.
The configuration of the shadow distribution model is 1) Path Loss Exponent: 2.7, 2) Standard
deviation: 4 and 3) Reference Distance: 4.0 m. We model 300 nodes distributed with uniform
random distribution in a region of the pipeline and CS. There are 170 remote terminal units, a
wireless router, a compromised node, a SCADA node, and the wireless network nodes for
communication. The wireless router is responsible for routing messages over the internet to the
SCADA system. UDP/IP as transport layer protocol and Ad-hoc On-demand Distance Vector
(AODV) as a wireless routing protocol are used to simulate the WMN. The compromised cyber
system (in Figure 3.5) is a node in the WMN that is used to control the functionality of the
remote terminal units in the physical domain. We use the results of this simulation in the attack-
defense scenario.
3.4 Denial of Service Attack
In this attack scenario, the attacker’s motive is to prevent the SCADA control
commands from reaching remote terminal units of the pipeline segment (see Figure 3.6). The
attackers perform a DDoS attack on the WMN by compromising wireless nodes and sending
56
packets to jam the network. Since the network is crowded, legitimate control signals are not
able to reach remote terminal units and pressure crosses MAOP. This attack reduces the
response of the SCADA system and thereby completing the attack-defense scenario proposed
in Figure 3.1. Similarly, an adversary can perform an attack on the WMN of the smart grid and
manipulate the state variables maintained by the SCADA system. We have not included this
case in this study. The simulation setup for this remains the same as described in the sub-section
3.3.2 pressure integrity attack scenario.
3.5 Analysis Methodology and Simulation Results
In this section, first, we describe the simulation setup of the grid and then the analysis
methodology. Finally, we discuss the results of the simulation.
Simulation Setup for Smart Grid. Power World Simulator is used to model the grid
component to perform simulation analysis. The power system is simulated using IEEE 9-bus
system (see Figure 3.7). We have modeled three generators of IEEE 9-bus model as Bus 1 Gen
(B1G) serve as Base Load, Bus 2 Gen (B2G) as Peaker power plant and Bus 3 Gen (B3G) as
DER PV.
Figure 3.7: IEEE 9-bus power model.
57
B3G is modeled as a DER and configured using WECC Solar Photo Voltaic dynamic
model specification [91]. We use frequency as a metric to determine whether the system shuts
down. Each generator is assigned with the frequency boundaries with the pickup time. Usually,
the limit of frequency is 60 Hz, the over frequency protection is enabled with a threshold of
62.4 Hz and under frequency is 57.60 Hz with a pickup time of 2 seconds. The generators will
trip due to over frequency protection mechanism if frequency exceeds 62.4 Hz for more than
2 seconds as described in [24].
Analysis Methodology. The following points describe the analysis methodology.
Steps 1 to 4 represent the simulation steps of worm propagation. Step 5 represents the response
of the SCADA system. Steps 6 to 11 describe the attack scenario when attackers attack the gas
pipeline system with a motive to prevent gas from reaching GFPP. Step 12 describes the
response of the SCADA. Finally, steps 13 and 14 present the DDoS attack to reduce the
response of the SCADA.
1. Create the background traffic in the AMI WMN of smart meters installed at the
customer premises in a particular zip code. Each command assumed to be the size
of 1000 bytes. We start the attack 150 secs after the simulation starts. The reason
behind this is that the WMN stabilizes with the flow of the traffic.
2. An adversary who controls the compromised node in the cyber system installs a
worm on a smart meter. The worm then propagates over the AMI network and
compromise the other smart meter present in the space. The worm is programmed
to disconnect the DERs at customer’s premises from the cyber network.
3. We capture the time at which smart meters are compromised by the worm and
disconnect DERs from the network.
58
4. When DERs at various homes are disconnected, there is a power generation loss. In
order to simulate DERs disconnect attack, we assume that the power supplied by
DERs is through B3G generator. Therefore, to demonstrate this loss of power
generation, we shut down the B3G at 400 seconds of the B3G in the PowerWorld
Simulator.
5. In response to this attack, the system engineers ask GFPP to generate more power
to cover the loss because of the previous attack. The peaker power plant starts
generating more power, and the system recovers from the worm propagation attack.
To demonstrate this action, we increase the exciter set point of B2G by 10% at 401
secs during the simulation just after B3G shutdown. If we delay the time, it will
only make the situation worse.
6. In order to disrupt the power generation of B2G, an adversary conducts pressure
integrity attack on the natural gas pipeline to prevent the gas from reaching GFPP.
We start the simulation by creating the background traffic where remote terminal
units are sending data to the nearest compressore station and the SCADA system,
about the status of a pipeline. Each command is assumed to have a size of 500 bytes.
7. The adversary controlling the compromised node generates a series of commands
targeting each remote terminal unit along the pipeline segment & compressor
station. The time interval between consecutive commands is varied using uniform
distribution (0, T) for different values of T.
8. We capture the time at which commands are received by each meter. The attacker
sends re-program commands to remote terminal units to show misleading pressure
readings to the compressor station and SCADA and instructs compressor station to
increase the pressure of the natural gas.
59
9. Once the attacker compromises RTUs, the incorrect information about the pipeline
status is delivered to the SCADA system, and the compressor station starts
increasing the pressure to meet the pressure delivery rate at the sink. The attack is
performed on a particular pipeline segment. If the pressure goes beyond MAOP, the
pipeline segment may burst, and leakage would happen. This stops the gas delivery
and causes significant loss of power generation when required.
10. To demonstrate this attack on the power system, we shut down the B2G at 500 secs
in the Power World Simulator.
11. Since natural gas is flowing through the pipeline, it is tough for the attacker to hide
the pressure increase from the SCADA for too long. SCADA system will discover
that pressure is deliberately increased in a particular pipe segment.
12. In order to reduce pressure, SCADA system sends signals to remote terminal units
and compressor station to increase the information about the pipeline segment and
to reduce the gas pressure.
13. Now attacker performs a DDoS attack by compromising particular nodes in the
WMN. Each node in the WMN sends data at a time chosen from the uniform
distribution (0, T), for different values of T.
14. We capture the cumulative number of remote terminal units received commands by
varying the number of nodes compromised during the DDoS attack. Finally, we
analyze the results of the simulation and describe the metric to consider for
developing robust systems.
Simulation Analysis. Figure 3.8 represents the generation and load requirements when
there are no attacks on the system. The blue line stands for generation and red line for the load.
The generation is slightly less than load because of the loss of power during transmission over
the buses. Figure 3.9 shows the area frequency of the IEEE 9-bus system. During the
60
simulation, frequency does not go beyond its protection thresholds. Figure 3.10 represents the
simulation result of the worm propagation attack. The cumulative number of meters infected
over a period is shown in the graph. The speed at which meters are infected depends on the
size of the worm, which we have not evaluated in this study. Once meters are infected, DERs
are disconnected from the power system. This leads to a loss of generation. Figure 3.11
represents the loss of generation due to worm propagation attack. At 400 secs, the generation
decreases, and the power system gets shut down due to under frequency protection mechanism
(Figure 3.12).
Figure 3.8: Generation and Load: No Attack.
Figure 3.9: Area Frequency: No Attack.
61
Figure 3.10: Worm Propagation over WMN. (Time in seconds.)
Figure 3.11: Generation and Load: Worm Propagation Attack
The primary justification for this understanding is that when there is a loss of generation
for some time, the area frequency decreases, and it crosses the under protection limit. The
generators do not have enough time to produce power to meet load, causing the frequency to
drop below the threshold. The under protection mechanism causes generators to trip to avoid
equipment damage. To prevent area frequency from crossing the threshold, the system
administrator increases the power generation of B2G.
62
Figure 3.13 shows little increase in generation after 400 secs, but still, the system is
destabilized and collapses because the frequency crosses threshold due to the sudden change
in power demand and supply. The system takes time to stabilize and bring frequency under
control. We assume that the system will eventually stabilize and our motive is to show the
effects of the attack on the system. In practice, a system may take considerable time to stabilize,
and attackers may initiate the pressure integrity attack on the gas pipeline to prevent recovery.
Figure 3.12: Area Frequency after Worm Propagation attack.
Figure 3.13: System Admin response to Worm Propagation Attack.
63
Figure 3.14: Remote Terminal Units along gas distribution pipeline compromise.
Figure 3.15: Pressure Integrity Attack on gas distribution pipeline.
Figure 3.14 shows the cumulative number of remote terminal units that received the
compromised commands and Figure 3.15 shows how the pressure of the gas is affected (if goes
beyond MAOP, gas delivery stops due to pipe closer or explosion) to prevent the gas delivery
at B2G. Time to Criticality (TTC) is an important parameter to consider; it is the time before
which a system should respond if the response is to prevent the collapse of the grid. Since gas
is not delivered to the GFPP, generation is impacted. To demonstrate this attack, we shut down
64
power generation due to B2G in the Power World Simulator. Figure 3.16 represents the loss of
power in the system, and the frequency crosses under the protection limit, causing the system
to shut down in Figure 3.17. To prevent this attack, system engineers send control commands
to the RTUs and compressor station along the pipeline. The attacker performs a DDoS attack
to reduce the response of the system so that the system remains in an unstable state.
Figure 3.16: Generation and Load after pressure integrity attack on pipeline.
Figure 3.17: Area Frequency after pressure integrity attack on pipeline.
65
Figure 3.18 represents how Packet Delivery Ratio (PDR) in the WMN is affected by a
DDoS attack when some nodes (f) are compromised to perform DDoS. Figure 3.19 shows the
reduced response of the SCADA system when some percentage (X) of wireless nodes are
compromised to perform DDoS. It is not only the WMN which is affected but other network
functions such as cybersecurity and system maintenance are also affected by such attacks,
which further reduce the response of the system and attack detection ability.
Figure 3.18: DDoS on WMN of gas distribution pipeline.
Figure 3.19: Reduced SCADA Response to DDoS attack (when time interval to send
commands is 30ms).
66
3.6 Usefulness to Power Engineers
In this chapter, we demonstrate attack scenarios that show the ability of an adversary
to perform a wide variety of CPAs and to broaden the surface area of attack by considering
alternative grid functions that provide power generation during contingencies. Through
network, gas pipeline and power system simulations, we showed that attackers could cause
partial or total system shutdown by performing manipulation of power generation and load
requirements and causing the frequency to cross over or under protection thresholds. Such
attacks are not contained to one area but propagate to neighboring regions, causing cascading
failures through overloading and tripping.
The results of the attack-defense study show that grid’s resilience is reduced when
functions and components that provide stability to the grid are compromised. Metrics such as
TTC, PDR, percentage of smart meters compromised and how they are compromised should
be considered while developing complex systems. Furthermore, it is crucial for system
engineers to evaluate the resilience of the grid by considering multiple functions and
components (that is responsible for the stability of the grid) together into a risk analysis. The
analysis of attack scenarios on the grid infrastructure helps engineers develop robust and
resilient systems, improve situational awareness, and improve the response of the system to
ongoing attacks. Instead of using generators to increase or reduce generation, one option is to
use demand response as a spinning reserve [43]. It is essential to maintain at least minimal state
awareness so that appropriate actions are taken before TTC. This approach assists power
engineers in securing backup options so that the primary function is delivered even in the
presence of attacks. We present the application of this approach in the oil & gas pipeline system
in section 8.3.
67
Chapter 4
End-to-End Risk Assessment
In this chapter, we present the end-to-end risk assessment methodology to compute risk
of a cyber-physical attack. We demonstrate the effectiveness of this approach by computing
the risk of manipulating circuit breakers to control different generators with different switching
times. Circuit breakers are used to protect the system from high fault currents automatically.
They are controlled by relays that either close or open those circuit breakers. The decision to
open or close a circuit breaker is made by a relay based on processing specific measurements.
If an attacker manipulates the configuration and settings of a relay, then the grid becomes
susceptible to cyber-physical threats. In order to quantify the risk of a cyber-physical attack,
two elements have to be computed: the probability and the impact of an attack. Therefore, the
contribution of this chapter can be summarized in the following points:
1. A methodology to estimate the probability of a cyber-physical attack using Bayesian
Network. The focus in this dissertation is on attack paths through which an adversary
exploits the Energy Control Center (ECC) to manipulate circuit breakers.
2. The impact of attacking ECC on the power system is evaluated using the IEEE 39-bus
model. The transient stability module of Power World is used to simulate circuit breaker
manipulation. The rotor angle of the synchronous generators and frequency of the
system are used to quantify the impact of the attack.
3. The risk of the cyber-physical attack is determined by combining the probability and
impact of the attack using a quantitative approach.
68
Figure 4.1: Smart Grid Cyber domain.
Figure 4.2: Test Network.
4.1 Risk Assessment Methodology
Risk is a function of impact and likelihood of cyber-physical attacks. The likelihood of
cyber-physical attacks is based on the vulnerabilities of the network components of the cyber
system that controls the power system and the ability of the attacker to compromise
vulnerabilities. The vulnerabilities of the cyber system are discovered by performing a
vulnerability assessment of the system components. The ability of an attacker to compromise
varies with his/her skill level. Attackers may belong to the level of script kiddies, criminals, or
government supported attacker [92]. The impact of attacks is estimated by simulating the
attacks on the power system using the Power World simulator. We assume that an adversary
69
wants to compromise a cyber system that controls the specific functionality of the grid that is
ECC to manipulate circuit breakers.
The steps of defining risk can be summarized as follows:
1. Define the cyber system that controls a specific functionality of the power system (such
as circuit breaker).
2. Perform vulnerability assessment of the cyber components and compute the probability
of compromise of the components of interest.
3. Simulate attacks on the power system using the Power World simulator.
4. Estimate risk on the power system using the probability of compromise and impact of
attacks.
4.2 Smart Grid Cyber Domain
In this section, we provide a brief description of the smart grid architecture and then,
we discuss the test network used for experiments.
4.2.1 Cyber Domain Architecture
Figure 4.1 provides a full vision of the proposed cyber domain of the grid and ensures
that minimum qualification of system requirements such as security management, network
deployment, and policy implementation. It also identifies the critical domain areas, functions,
and their weaknesses. Thus, it is vital to understand the cyber domain so that the system
engineers can establish and implement security policies effectively. We refer to the guidelines
of the NIST Smart Grid cyber domain [93] to develop the test network. The derived network
is a good approximation of the real-world system because it is based on the NIST framework
and roadmap for Smart Grid Interoperability Standards, Release 1.0 (NIST SP 1108), which is
70
provided by the U.S. Department of Commerce. The same standard is used to design real-world
power grid and develop standards on how to protect these systems from cyber attacks. The
standard provides general cyber domains and their inter-connection in the grid. We considered
the network components described in [93] to build the test network in Figure 4.1 and 4.2.
Starting from the left-hand side, building automation system, smart appliances, and
electric vehicles are the endpoints that consume power from the grid. Building automation
system manages the power consumption of the smart buildings and interacts with the central
control system which controls other parts of the power network; smart appliances refer to
power consumption devices at homes that connect to smartphones, desktop or laptops and
provide more control and information remotely. All such components send readings about the
use of power by each appliance and electricity quality delivered to endpoints to AMI meters.
Further, AMI meters send the collected data to the field collection system, vendor head end,
and billing system. At these places, data is processed and stored in relevant databases. The data
will then be sent to Meter Data Management (MDM) system where the data is used for various
functions such as power prediction, customer load profile management, power outages, power
quality at different places, and billing customers. MDM connects to Customer Information
System (CIS), which maintains all the customer’s database. All the power quality readings are
transferred to the SCADA, where it monitors the overall power grid functionality.
ECC further manages power generation, distribution, and delivery. The data stored in
the CIS and MDM, Outage Management System (OMS) predicts power outages and send
information to the control center to manage power generation. Demand Response (DR)
functionality is controlled through DR proxy. Many independent organizations are connected
to the DR proxy. DR proxy is further connected to DR Data Repository (DRDR). DRDR
connects to DR Decision Support (DDS) and DR Automation Server (DRAS) that decides
71
whether to perform DR functions in a given organization. DRAS interact with the OMS to
check whether DR is required in response to reduced power generation. By understanding this
architecture, we know various power grid functions and components and how they are
connected. Through ECC, an adversary controls circuit breakers that connect generators to the
grid.
4.2.2 Test Cyber Domain Architecture
To perform experiments, we consider a test network in Figure 4.2, which is a part of
the smart grid cyber domain. The system consists of smart appliances and electric vehicle
attached to the grid. The power consumption and electricity quality readings from these
components are collected by the field collection systems, vendor-specific heads, and billing
system. Further, this information is communicated to MDM, and via MDM, it is transferred to
OMS to keep track of reserved power and power outages that might happen shortly. It also
manages how much amount of power should be stored to meet the demand during
contingencies. OMS interacts with the ECC to take decisions regarding power generation, DR,
power delivery, etc.
4.3 Likelihood of Attack
In this section, first, we provide the background which includes a brief description of
the Bayesian Network (BN) and then, we describe how it is applied to the smart grid resilience
modeling. Second, we discuss whether BN is efficient for the smart grid cyber domain. Third,
we present the BAGS tool to compute the probability of compromise, and fourth, we discuss
how to compute the probability of compromise of various vulnerabilities and network
components. Finally, we analyze the results via simulations.
72
4.3.1 Background
In this sub-section, we provide a brief description of the Bayesian Network (BN) and
then, we describe how it is applied to compute the overall probability of compromise a cyber
function.
Bayesian Belief Network (BBN). It is a probabilistic graphical model based on Bayes’
theorem. It represents the conditional dependency between a set of random variables in the
form of a Directed Acyclic Graph (DAG) G = (V, E). V = {V 1, V 2, ......, V N} is a set of nodes or
variables of the system, and E is the set of edges representing the dependencies among
variables. A link E i, j from Vi to V j represents the causal dependency between these two nodes.
Here, V j depends on the value of V i and V i is called parent (pa) of V j. The relationship between
variables of the BBN is measured using the Conditional Probability Distribution (CPD). The
joint probability distribution of N variables is:
P(V 1,...,V N) = Πi
N
P(V i |V i+1 ...,V N) (1)
P(V 1,...,V N) = Πi
N
P(Vi | Parents (Vi)) (2)
To calculate the joint probability distribution, the individual distributions, and
conditional distributions among parent and children must be predetermined. Such expectations
are measured from data analysis, expert knowledge, or the combination of both and by
performing the simulation. The significant advantage of using BBN is to compute posterior
probabilities of an event when certain events are observed in real time. This is called belief
propagation. For instance, the likelihood of the smart grid to be in the resilient state is updated
when certain disruptive events are observed on some of its components.
73
Figure 4.3: Smart Grid Resilience Modeling.
BBN Applied to Resilience Modeling. The probability of system resilience [95] is
expressed regarding the probability of reliability and restorative capacity of the system. The
likelihood of restoration depends on the likelihood of reliability and system characteristics
when disruptive events are observed. The probability of reliability depends on the robustness
and adaptiveness of the system. The disruptive events can happen on different components of
the smart grid. It depends on how system components are connected and how an event
propagates from one part to other parts of the system. The joint probability distribution of the
system, according to Figure 4.3, is defined as:
P (Resilience) = P (Disruptive Events) * P (Smart grid Components State | Disruptive
Events) * P (Robustness | Smart grid Components State) * P (Adaptiveness | Smart grid
Components State) * P (Reliability | Robustness, Adaptiveness) * P (Restoration | Reliability,
Smart grid Components State) * P (R | Reliability, Restoration)
4.3.2 Is Bayesian Network efficient for smart grid?
Bayesian Networks evaluates the causal relationship between random variables given
uncertainty about events. In the grid, random variables are set of cyber nodes and network
components modeled as Bernoulli variables and event corresponds to the node X getting
compromised. For X = True, we get the probability of compromising node X and for X = False,
74
the probability of not compromise. Furthermore, BN assists system administrator in
understanding various paths taken by an adversary to move forward into the smart grid network
and the likelihood of taking those paths. Since BN incorporates the likelihood of compromise
of each node in the path, it makes perfect sense to use BN. However, to compute unconditional
probabilities on the BN is an NP-Hard problem. The question arises: what if the size of the
network is in hundred thousand range?
For smale scale BN, we have to use efficient exact (Variable Elimination or Junction
Tree) and approximate (Loopy Belief Propagation) inference techniques to compute
unconditional probabilities [94]. The approximate inference is used explicitly for large graphs.
According to [94], as we increase the size of the network, the time to compute unconditional
probabilities increases. We should first answer:
1. What is the approximate size of the smart grid cyber network that is possible?
2. Can we reduce the size of the network without affecting the results? If yes, how to
reduce the size of the network?
Size of the Smart grid Cyber Network. Smart grid cyber network is the combination
of many IoT devices, smart meters, subnetworks that represent various functions (such as
Billing system, OMS, MDM, DR, etc.). It is divided into different domains [93]: Customer,
Grid Operations, Third Party, Generation, Transmission, and Distribution. Each domain has a
set of nodes for different functionalities, and each node represents a subnetwork that
implements that node. For a city, we have more than hundred thousand customers, and if we
consider all customers in the BN, the number of nodes for the customer will only be so high
because of each customer, we will include the smart meter, smart meter collector, etc. Smart
meter collector sends power consumption and quality readings to the smart grid operations
75
nodes, and finally, operations nodes are connected to the grid’s generation, transmission, and
distribution functions. Therefore, the size of the cyber network is given by:
Size = (Number of customers * Nodes per customers) + (Number of grid functions *
Average number of nodes per function) + (Third Party Nodes) + (Number of nodes in
generation function) + (Number of nodes in transmission function) + (Number of nodes in
distribution)
The factors that are responsible for increasing the number of nodes in the network are
customers and the number of nodes per grids operations. For such an extensive network, we
have to use approximate inference techniques to compute unconditional probabilities.
Can we reduce the size of the network? The security domain is a boundary of
functional components that represent if a system is compromised; all other components in the
boundary are accessible. For instance, consider the DR network functions in Figure 4.4. If an
adversary can compromise the DR proxy, she will be able to access all other functions. If we
apply security domain concept in our BN, we should only consider the main components in a
function to compute the probability of compromise because those network components will
represent the overall probability of compromising the functional node. In this way, we reduce
the number of nodes to consider in computing the unconditional probabilities. So, in Figure
4.4, we should only consider the vulnerabilities and probability of DR proxy server. This
information is provided by the system administrator and security engineers that which system
is essential and which one is not based on the privileges of the network components. In the
simulation, we assume only one main network component and one vulnerability for each to
demonstrate how BN is used to compute the probability of compromise of ECC.
76
Figure 4.4: Demand Response Internal Functions.
Figure 4.5: BAGS System Design.
Furthermore, instead of modeling all customers in the BN, we consider only one
customer node. It is because all customers will have almost the same set of nodes and
implemented with the same software (of the same power utility). Thus, they will have the same
set of vulnerabilities and the probability of compromise. Therefore, we use only one smart
meter, smart meter collector, and smart grid head end server corresponding to a customer node.
If we would consider, let us say, 100,000 customers, the number of customer nodes would be
300,000. However, they have reduced it to 3 only. It is because an attacker will enter from any
one of the customers’ premises into the grid network, and all customers have the same
probability of compromise (for the same utility company). In this way, the size of the network
77
is reduced drastically, and now we can use Exact inference algorithms to compute
unconditional and posterior probabilities. Thus, BN can be used to model the smart grid
network. If the number of nodes in the network increases, we should use approximate inference
techniques to compute unconditional probabilities.
4.3.3 Bayesian Attack Graph for Smart Grid (BAGS)
In this sub-section, we present the Bayesian Attack Graph for Smart Grid (BAGS) tool
(Figure 4.5) to quantify the probability of compromise for each smart grid cyber domain.
BAGS takes functions, network architecture, applications, and a vulnerability report as input
and generates three BN.
The top-level network is called Functional Bayesian Network (FBN) that defines how
smart grid functions are connected, their probability of failure and connection with the
resilience variables. The possibility of a function compromise is the joint probability
distribution of its network components that are based on the vulnerabilities of each component.
FBN can be expanded to the second level as Network Bayesian Network (NBN) that can be
further expanded to the third level as Vulnerability Bayesian Network (VBN). BAGS provides
ease to system engineers to perform an in-depth study of one of the functions of the grid. The
system engineers can incorporate this functionality into their system, and they can see the
impact of any compromised component of the grid on its resilience. It also helps them to
identify the failure paths in advance from one grid’s function to another so that they can devise
appropriate security strategies and deploy resources effectively and efficiently.
Tool Input Variables.
1. Network and Smart Grid functions: It represents the list of higher level network
and smart grid functions such as energy management system, outage management
78
system, power generation, demand response, smart meters, billing function, and
vendor head end.
2. Network and SG Architecture: The network architecture supports all the
functions of the smart grid system. For instance, MDM consists of servers,
workstations, database, and communication network. It also represents the
interconnection between all these components.
3. Applications: The list of client side and server side applications are also given as
input to the system. It also includes vendor side applications which integrate to the
smart grid architecture.
4. Vulnerability Report: To compute the likelihood of a particular system
compromise, we must have a list of vulnerabilities that exist in the system’s
components and applications. For instance, the billing server has Cross Site
Scripting (XSS), local file inclusion vulnerabilities, which can be exploited to gain
control over the billing server. We will use the Common Vulnerability Scoring
System (CVSS) [80] scores to compute the likelihood of compromise.
Once we have these inputs to the system, it generates three BN at three different levels
of hierarchy. The output of the model is displayed on an interactive dashboard where system
engineers can select a particular function and can view its system architecture, components,
and information flow. Furthermore, they can select a particular component and view its
vulnerability report and likelihood of its compromise. These features give power to the system
engineers to perform real-time monitoring and predict the impact of a compromise on different
components of the system and finally, quantify the resilience (probability of compromise).
Function Bayesian Network (FBN). FBN represents the causal interconnection
between different functions of the grid. Nodes in the FBN describes the functions of the grid
79
and edges represent the information flow from one function to another. It also describes how
the impact of a system compromise travels across different functions. We use the test network
presented in Figure 4.2 to provide a better understanding of the FBN. Figure 4.6 depicts the
FBN of the test network. Smart meter function (S1) sends messages to the smart grid head
function (S2), which further send messages to the billing system (S3). Similarly, electric
vehicle charging data (consumption and power quality) is captured by the Vendor particular
head end (S6).
The data set from the billing system, head end, and the vendor-specific end is provided
to the MDM (S8). MDM connects with the OMS (S4), which is further connected to the ECC
(S7). Finally, ECC is connected to Resilience function (R). Suppose an adversary compromises
a smart meter by injecting a malware, she sends compromised messages about the power
consumption to the billing center and also exploits the vulnerability to gain control over the
component. If she gains control over the billing system, she escalates privileges by exploiting
vulnerabilities of other systems. FBN provides understanding to system engineers that how
information and control flow works in the grid cyber domain. They can identify various attack
paths an adversary may take if a specific function is compromised. In our model, they can
choose any function in the FBN and view its network components.
Figure 4.6: Function Bayesian Network.
80
Figure 4.7: Network Bayesian Network.
Network Bayesian Network (NBN). NBN represents various network components
supporting a particular function. The system engineers can select a specific function at the FBN
level on the dashboard and can see its network components and analyze its functional status
and information flow. Figure 4.7 describes the NBN of the FBN in Figure 4.6. To illustrate
with an example, the smart meter functionality (S1) has components: home appliances, electric
vehicles, smart meters, and smart meter collector server. The communication between these
components is mostly wireless. Smart Meter collector server collects data from all the smart
meters defined in its zone and sends that information to the smart grid head end (S2) over the
wireless mesh network. Further, S2 consists of workstations, database servers, and smart grid
head server, which collects data from various S1 systems. The significance of such modeling
is to understand how data flows from one system to another, in other sense, whether there is
any vulnerable path from one system to another. By providing such an interface, it provides
the ability to analyze every system component and its impact on the overall system. We have
not shown firewalls and routers in this network diagram. We have displayed components that
provide the functionality to the grid and have vulnerabilities in their implementation. However,
we can extend this model to show other network components.
81
Figure 4.8: Vulnerability Network.
Vulnerability Bayesian Network (VBN). Power engineers can select a particular
network component from the NBN and view the list of its vulnerabilities by vulnerability report
submitted as an input to the system. It also provides information about the likelihood of
network component compromise based on the CVSS score. The possibility of the components
combined to calculate the compromise probability of the network component and ultimately of
the function to which network components belong to. Also, if there is a change in the
vulnerability, the system automatically updates the belief of the system compromise and
propagate to other parts of the network.
Attack Graph: Figure 4.8 describes the vulnerability attack graph from a remote
attacker’s point of view according to our test case. A remote attacker performs a variety of
attacks to gain access to the grid functions. According to our test case, an attacker exploits
XSS, CSRF, or SQL Injection vulnerability to gain access to the Vendor-specific server or
performs remote code execution on smart meters or smart meter collector. Once the attacker
has access to the smart meter collector, she exploits open SSL heart bleed to gain access to the
smart grid head server from where she targets the billing engine server. The SQL Injection
82
vulnerability can be exploited to gain remote access through username and password from the
database. Once she gains access to the billing engine, she performs the port scan over the
network range and identifies the MDM server.
Once the MDM server is identified, she exploits the buffer overflow vulnerability
present in the server operating system. Then, the attacker gains access to the MDM directly
without going through the billing engine. Once she gains access to the MDM server, she
performs the variety of attacks such as integrity attack by changing meter readings, but in this
case, she is interested in having access to the energy control center. So she further performs
scanning and identifies the OMS server which is connected to the MDM server. She exploits
the open SSL v3 POODLE vulnerability and gets root access. Once she has access to the OMS
server, she identifies and attacks the ECC by exploiting open heart bleed vulnerability.
Tool Output. The main motive of this tool is to provide the following functionalities:
Measure Resilience: The primary motive is to measure the resilience of the system in
real time. By considering the vulnerabilities of the system components, the likelihood of their
compromise is calculated. Using BN, we connect different system components and see how an
attack propagates from one system to another. The resilience is computed based on the
probability of the likelihood of ECC compromise, which controls the power generation and
distribution. If the probability of ECC compromise is high, the system is not resilient, and
power delivery will be affected in case of an attack. The probability of the ECC compromise
works as an identifier of the resilience of the system.
Alert Mechanism: The alert mechanism helps the system administrator to put
checkpoints on the probabilities of the system component compromise. Based on their
knowledge of the system, they assign probability thresholds to each component described in
83
the test network. If the probability of the system component compromise crosses a threshold,
an alarm is raised. This enables system engineers to identify most of the vulnerable components
so that they can assign appropriate security controls and perform vulnerability assessment and
penetration testing.
View System Architecture: The system engineers view the whole system on an
interactive dashboard. The interactive dashboard provides a view of 1) FBN where all the
system functions are logically connected, 2) NBN which is a detailed description of the
functional components and 3) VBN which describes the list of vulnerabilities associated with
the components and probabilities of their compromise (unconditional probabilities) and attack
graph. This enables system engineers to analyze the status of the grid remotely, and they can
perform impact analysis by simulating different attack scenarios on different components of
the system.
4.3.4 Probability to Compromise a Cyber Function
In this sub-section, we discuss how to quantify the likelihood of compromising a cyber
function in the smart grid cyber domain. Circuit breakers are controlled remotely by power
engineers through ECC. The likelihood of compromising ECC to control circuit breakers
depends on the vulnerabilities of the ECC’s network components. Some of the reasons that are
responsible for the presence of vulnerabilities are:
1. Misconfiguration of servers,
2. Unpatched version of the software,
3. Local and remote code execution,
4. Buffer overflow, and
5. Open ports, etc.
84
The description of these vulnerabilities is out of the scope of this dissertation. An
adversary performs a variety of scanning techniques (Network, Port, and Vulnerability) on the
ICS network to discover vulnerabilities and exploit them to gain control. In order to compute
the likelihood of comprising ECC, we compute the probability of compromise of each network
component that lies in the path to access ECC. We consider CVSS [80] to compute the
probability of compromise of each network component based on the set of vulnerabilities they
have. Once we know the vulnerabilities, we use the online CVSS calculator to compute the
probability of compromise for each vulnerability and thus for each network component. We
feed the probabilities into a Bayesian Network, which is built based on the information flow
from the point of Internet access to ECC.
CVSS consists of three scores: Base, Temporal, and Environmental [80]. The base
score includes the basic properties of the vulnerability such as Attack Vector (AV), Attack
Complexity (AC), User Interaction (UI) required or not, Privileged Required (PR) or not,
affecting Confidentiality, Integrity, and Availability. The temporal score addresses the
characteristics of the vulnerability that evolve over its lifetime — for instance, exploiting code
maturity. Finally, the environmental score indicates characteristics that are dependent on the
implementation and environment of the organization. We compute the probability of exploiting
the vulnerability (v) by considering CVSS online base score.
The online CVSS calculator computes the base score of vulnerability once we know
the values of its fundamental properties. For instance, for SSL v3 Poodle vulnerability, the
qualitative values for each parameter are AV: Network, AC: High, PR: None, UI: Required,
Confidentiality: High, Integrity: Low, Availability: Low, and Scope: Unchanged. Those
qualitative values are mapped to quantitative values to get the base score, which is 3.1. Since
the score of all vulnerabilities is out of 10, we divide this base score by 10 and compute the
85
probability of compromise of this vulnerability, 0.31. The security engineers maintain the
values of these parameters for each vulnerability in their database, and they should refer to
CVSS for updates. The method of computing the probability of compromise of each cyber
component:
1. Scan each network components (e.g. smart meter) using automated tools and
discover vulnerabilities.
2. For each network component:
a. Compute the probability of compromise using CVSS score for each
vulnerability.
b. Compute the probability of compromise of each component using CVSS
probabilities of each vulnerability.
c. Combine the probabilities of each component to compute the probability of
compromise for each functionality.
3. Feed probabilities in a Bayesian network. It will automatically compute the
unconditional probabilities of each component using Bayes rule.
Each node represents a cyber domain function and it consists of various network
components that implement that functionality. Each network component has some
vulnerabilities which can be exploited by an adversary. We use the combination of these
vulnerabilities to compute the probability of compromise for a single function based on
whether each component is required to get compromised or not and each vulnerability is
required to get exploited or not (based on AND or OR rule). We create a Local Conditional
Probability Table (LCPT) for a network component which depends on the parent of that
network component. For instance, in figure 4, MDM has three parents: smart meter collector
(SMC), vendor specific server (VSS), and billing engine (BE). Three different functions may
86
connect to MDM at different network components which may have different vulnerabilities to
exploit. If an adversary has access to any one of the three functions, he can also exploit MDM
component’s vulnerabilities. So, the LCPT contains three columns one for each parent. The
probability to compromise MDM now depends on the parent from which the attacker is trying
to exploit a specific MDM function vulnerability. We use OR rule to compute P(MDM | SMC,
VSS, BE), P(MDM | SMC, VSS’, BE’), etc. based on the vulnerabilities of the MDM
components. When this value is inserted in the BN and solved, it gives the unconditional
probabilities since it takes into account the probability if exploiting parents as well.
We assume that network components have these vulnerabilities shown in Table 4.1,
which are derived from [80]. We have made an informed guess about the probability instead
of assuming random numbers for the successful attacks. From CVSS base score computed
online, we have the probability of compromise for each vulnerability (Table 4.2) and finally,
for each network component. The probabilities of the compromise of a particular system will
change over time. Therefore, posterior probabilities of system components are useful to
evaluate such risk in the dynamic environment using Bayes rule. In this way, engineers evaluate
the effect of an attack propagates to other parts of the system. We feed values of compromise
in the Bayesian Network and compute the probability of compromise of ECC.
We have a different level of attackers in terms of their ability to perform sophisticated
attacks [92]. Script kiddies are the first level of attackers who attack randomly; their chance of
detection is high. The second level attacker is a motivated attacker, someone who understands
the system and then attacks using predeveloped attacks. Finally, organized criminals (terrorist
or state-sponsored) who are professionals in hacking. We assume the probability that an
attacker of a specific level may initiate an attack is as indicated by Table 4.3.
87
Table 4.1. Network Component Vulnerabilities.
System Names Vulnerabilities
Smart Meters
Default Password, Remote Code
Execution
Smart Meter Collector Server Remote Code Execution
Smart Grid Head End (Windows Server 2008) Open SSL Heart Bleed
Billing Engine Server (Windows Server 2008) SQL Injection
Meter Data Management Server
(Windows Server 2012)
Open SSL v3, Open SSL HeartBleed
Outage Management Server (Windows Server
2012)
Open SSL POODLE
Energy Control Center Server (Windows
Server 2012)
Buffer Overflow
Vendor Specific Server (Windows Server
2012)
Cross Site Request Forgery, Cross Site
Scripting (XSS), SQL Injection
Table 4.2. Probability of Compromise using CVSS online Base Score.
Vulnerabilities Probability of Compromise
Remote code execution 0.84
Buffer Overflow 0.78
Denial of Service 0.74
SQL Injection MS SQL Server 0.72
Open SSL Heart Bleed 0.75
Open SSL POODLE 0.31
Cross Site Scripting 0.61
Cross Site Request Forgery 0.88
Table 4.3. Probability of compromising ECC for different levels of hackers.
Levels of Hackers
Probability of
attack
Worst Case
Probability of
Compronising
ECC
Script Kiddies 0.0 − 0.3 0.3 0.0469
Motivated 0.3 − 0.7 0.7 0.1157
Organization Hackers 0.7 − 1.0 1.0 0.1653
88
Figure 4.9: Function (top), Network (Middle) and Vulnerability (bottom) Bayesian Network.
Figure 4.10: AIspace Bayes.jar tool Funtion Nodes.
4.3.5 Tool Prototype and Simulation Results
We have developed the User Interface (UI) of the tool in Java language using regular
window toolkit class. It represents the framework that is visible to system engineers on the
dashboard. We maintain a set of static files (as a database) of network components, functions,
89
and vulnerabilities as input to the tool. The tool parses the file and generates the FBN (see
Figure 4.9). Note, our motive is to demonstrate a mock-up/UI of the tool that shows its
functionality.
FBN represents the acyclic graph of the connected components according to the test
network. When a user clicks on a node of FBN, the function’s network components are
represented and how they are connected (see Figure 4.9, middle). When a user clicks on a
particular network component, the list of vulnerabilities associated with that component is
generated (see Figure 4.9, bottom). It also contains the details of the vulnerabilities, CVSS
score, and the probability of compromise. When system engineers change the system
configuration in the database, the same changes are reflected in the tool. They can add or
remove any component, discover or patch any vulnerabilities, and disconnect any component.
We use the AIspace simulator [96] to represent the unconditional probabilities of the
functions. We create nodes corresponding to each function and link them according to the test
network (see Figure 4.10). Then, we provide the probability of compromise to each component.
It gives unconditional probabilities of compromise of each function as output. One can view
the expectations by selecting any function on the dashboard. Figure 4.11 represents the
unconditional probabilities of all functions of the test network. Each function is a Bernoulli
variable. True (T) variable represents the likelihood of compromise, and False (F) represents
the probability of not getting compromised. A system engineer can easily monitor the status of
the components regarding compromise probabilities by analyzing this graph. If any
vulnerability is discovered in a component or if any vulnerability is patched, the unconditional
probabilities will change automatically. Figure 4.12 represents the case when engineers have
patched the billing engine server, and its probability of compromise becomes zero. There is a
drastic change in the likelihood of compromise of other components which are children of
90
billing engine server. Similarly, figure 4.13 represents when the zero-day vulnerability is
discovered in the billing engine server and how the probabilities of compromising of its
children change. This is how BAGS enables engineers to evaluate the risk associated with
every component and how risk propagates from one component to another in such an
interdependent network. Engineers set a threshold on the unconditional probability, based on
their experience, of any function and create an alert mechanism.
Figure 4.11: Probability distributions when probability of remote attacker to attack is 0.70.
Figure 4.12: Unconditional Probability Distributions when Billing Engine’s SQL Injection
vulnerability is patched. The effect of such change is propagated to other components.
91
Figure 4.13: Unconditional Probability Distributions when Remote Code Execution is
vulnerability is discovered in Billing Engine server. The effect of such change is propagated
to other components.
The advantage of using the Bayesian Network is that once the vulnerabilities change,
the likelihood of ECC compromise will also change and thus risk. Once attackers control ECC,
they manipulate circuit breakers’ functionality. The graph further educates system engineers
about various paths attackers might take so that they can patch those system components. Using
this Bayesian Network and the likelihood of compromise of each vulnerability, we compute
the likelihood of compromise of each cyber function of the grid. Now we assess the impact of
compromising the cyber domain on the physical power grid. Note, in this section, we have used
broad term ECC which controls many functions of the power grid. Attackers are only interested
in controlling circuit breakers. While performing scanning on the ECC, attackers figure out
which sub-system is controlling circuit breakers, and they try to compromise that system. This
methodology accounts for the known vulnerabilities, not for zero-days.
Posterior Probabilities. The probability of compromising a function changes over a
period depending on the vulnerabilities of its network components and other factors. The
posterior probabilities of system components are useful to evaluate the risk in the dynamic
92
environment. For example, if we know that OMS is compromised, we can calculate the
likelihood of MDM compromise using Bayes rule:
P(MDM/OMS) = P(OMS/MDM) P(MDM) / P(OMS) (7) = 0.99
We already know the value of P(MDM)=0.4786, P(OMS)=0.1484 (see Figure 4.11)
and P(OMS/MDM) = 0.31. The unconditional probability of MDM getting compromised was
0.4786. However, once we know that an attack incident at OMS, the posterior probability
becomes 0.99. Similarly, power engineers can calculate probabilities of successors of MDM
and other nodes in response to an attack incident on OMS. Such a technique allows engineers
to see how the effect of an attack propagates to other parts of the system. For instance, when
the individual system is outdated and has more vulnerabilities exposed to the outer world,
attackers have the larger surface area to compromise it, and that will affect the probability of
compromise of other nodes as well. Similarly, engineers evaluate how the attack surface area
is reduced when a particular security control is placed on the node or vulnerabilities are reduced
by updating the software (patch management).
4.4 Quantify Impact of Manipulating Circuit Breakers
In this section, we perform transient stability analysis to quantify the impact of
manipulating circuit breakers.
Attack Modeling. An adversary scans the business network endpoints accessible over
the internet to find their vulnerabilities. She also performs a social engineering attack via spear-
phishing emails to know the credentials of the employees. Once she knows the credentials, she
impersonates the power system employees and enters the business/cyber network. She
compromises systems of the cyber network of the grid to find which system is responsible for
93
controlling circuit breakers, that is ECC. Finally, she manipulates any circuit breaker for any
duration at any time.
4.4.1 Physical System Modeling
We use the IEEE 39-bus New England power system as a test system, see Figure 4.14.
The system model and its parameters have been published in [97] [98]. The IEEE 39-bus
system has been used extensively in power system stability studies. The studies include
transient stability [99], voltage stability [100], and rotor angle stability [101][117]. In this
section, we perform transient stability analysis on the IEEE 39-bus system without any
modifications. In chapter 5, we modify the system by replacing the generator at bus 39 by
Photo Voltaic system and evaluate whether it can improve the resilience of the grid against
CPAs.
Figure 4.14: The IEEE-39 Bus Model.
94
4.4.2 Quantify Impact and Simulation Anaysis
Transient stability refers to the ability of the system to retain synchronism after having
a significant disturbance [103][102]. To study the impact of an attack on the power grid, we
monitor the rotor angles of the generators and the frequency of the system. The disturbance
could lead to a loss of synchronism of one generator (local) or loss of synchronism in the whole
power system (global). The power system could remain stable in the local case if there is a
power balance between the supply and demand, but it becomes unstable in the global case.
That means the rotor angle stability could be either local or global. However, the frequency
stability issues are global and affect the whole power system. So, the frequency of the system
is used to measure the impact of manipulating circuit breakers. Based on the changes in the
frequency value, the impact is quantified. Our analysis took into consideration two main factors
that could influence the attack:
1. The generator under attack taking into consideration the amount of power produced by
this generator. Two generators were analyzed: first, generator on bus 30 as it produces
the least amount of power (250MW). Second, generator on bus 33 as it represents an
average size generator (632MW) compared to other generators in the system.
2. The period (T ) during which the breaker will remain open. We assume that the attacker
opens the breaker connecting the generator to the system and waits for time T then
closes the breaker.
Two factors that could influence the attack were taken into consideration: the size of
the generator under attack (MW) and the period (T) during which a breaker remains open. We
have chosen a valid assumption where attacks are considered only at small/medium size
generators. If the system becomes unstable for an attack at small/medium generators, it is clear
that the impact is worsen if an attack occurs at larger size generator. It would be interesting to
95
simulate all the possible scenarios of cyber-attacks at different generators in the system model.
That could help us find the locations that are more/less vulnerable to damaging attacks.
However, this analysis could be more suited to our future work.
Standard frequency deviations are approximately ±0.03Hz from the scheduled value. If
frequency deviates by more than ±1.0Hz, then damage to customer and utility equipment could
occur [104]. Table 4.4 demonstrates the three levels of the impact on the smart grid: no impact,
low, medium, and high. Where high represents the worst impact on the smart grid with an
assigned value of 10. Figures 4.15 (frequency) and 4.16 (rotor angle) demonstrate the results
of attacking the breaker connecting the generator on bus 30 (250MW), which produces the
least amount of power in the system. The breaker was opened after one second of the simulation
and closed after the period T, which is varied from 100ms to 500ms in 100ms steps. The results
of the five scenarios are shown in Figures (4.15a – 4.15e) and (4.16a – 4.16e). Figure 4.16
(rotor angle results) confirms the results presented in Figure 4.15. The results in Figures 4.16a
– 4.16d demonstrate increased oscillations. Nevertheless, the system remains stable and in
synchronism. However, in Figure 4.16e, the increased oscillations become sever, which leads
to a loss of synchronism.
Similarly, we analyze the impact of attacking a generator on bus 33 that produces an
average amount of power (632MW), see Figures 4.17 and 4.18. The breaker was opened after
one second of the simulation and closed after the period T, which is varied from 100ms to
500ms in 100ms steps. The results of the five scenarios are shown in Figures (4.17a - 4.17e)
and (4.18a - 4.18e). Figure 4.18 (rotor angle results) confirms the results presented in Figure
4.17. The results in Figures 4.18a - 4.18b demonstrate increased oscillations. Nevertheless, the
system remains stable and in synchronism. However, in figures (4.18c - 4.18e) the increased
oscillations become sever which leads to loss of synchronism.
96
Figure 4.15: The frequency of the system on four buses when the attack happens. Attack
performed by opening the breaker of generator on bus 30 and closing it after time T. Each
figure demonstrates different duration during which the breaker remained open. The breaker
is opened after 1 second of the simulation. NO PV is used in the simulation.
Figure 4.16: The rotor angle of the system on four buses when the attack happens. Attack
performed by opening the breaker of generator on bus 30 and closing it after time T. Each
figure demonstrates different duration during which the breaker remained open. The breaker
is opened after 1 second of the simulation. NO PV is used in the simulation.
97
Figure 4.17: The frequency of the system on four buses when the attack happens. Attack
performed by opening the breaker of generator on bus 33 and closing it after time T. Each
figure demonstrates different duration during which the breaker remained open. The breaker
is opened after 1 second of the simulation. NO PV is used in this simulation.
Figure 4.18: The rotor angle of the system on four buses when the attack happens. Attack
performed by opening the breaker of generator on bus 33 and closing it after time T. Each
figure demonstrates different duration during which the breaker remained open. The breaker
is opened after 1 second of the simulation. NO PV is used in this simulation.
98
Table 4.4. Frequency deviation and impact are deduced from [104]. The third
column represents our estimation of the quantitative impact.
Frequency Deviation
(Hz)
Impact Quantitative Impact Value
(59.97 - 60.03) Normal operation No impact (0)
(59.50 - 59.97) or
(60.03 - 60.50)
Continuous operation but
undesirable
Low (3)
(59.00 - 59.50) or
(60.50 - 61.00)
Restricted operation Medium (6)
Frequency > 61.00 or
frequency < 59.00
Damage to customer and
utility equipment
High (10)
4.5 Risk Determination
Risk is defined as the impact of an event (CPA) times its likelihood [25]. To compute
the risk value, we multiply the impact by the likelihood according to the following equation:
Risk = Impact × Likelihood
We analyze the results demonstrated in Figures 4.15 and 4.17 and map them to Table
4.4. According to Figure 4.15a (attack on bus 30 where T = 100ms), the frequency is within
the continuous operation region (i.e., frequency exceeds 60.00±0.03 Hz but within 60.00±0.50
Hz). The impact of this attack is low and assigned a value of 3 (Table 4.4). This value is
multiplied by the likelihood. The probability of successfully compromising ECC, as
demonstrated in Figure 4.11 and assuming a “motivated attacker” is 0.1157. So the risk value,
in this case, is 0.35. The risk values for attacking the circuit breaker on bus 30 varying the
period T (100ms, 200ms, 300ms, 400ms, and 500ms) and assuming a ”motivated attacker” are
0.35, 0.35, 0.69, 1.16 and 1.16, respectively. Table 4.5 shows the risk value for attack on Bus
30.
99
The risk values of attacking the circuit breaker on bus 33 varying the period T is 0.35
for T = 100ms and 1.16 for the remaining cases. Table 4.6 shows the risk value for attack on
Bus 33. These results refer to the scenario 1 when no system is patched.
To provide meaning and relative comparison, suppose T =100ms is the base case where
the risk score for attacking bus 30 and bus 33 is 0.35 for the scenario 1. As the duration of
attack increases the risk value also increases. The risk value for attacking bus 30 is 1.97 times
the base value for T =300ms and 3.31 times for T =500ms. This elucidates that it is necessary
to apply some security mechanism which will prevent the situation from getting worse. If attack
happens on bus 33, the risk value is 3.31 times the value of bus 30 for T =200ms. This indicates
power engineers should choose which bus to protect so that the impact can be minimized.
Table 4.5. Risk Values for Attacking CB connecting a generator at Bus 30.
Time
Period
(ms)
Impact
Score
Risk (Scenario 1)
P(ECC) = 0.1157
Risk = p(ECC) *
Impact
Risk (Scenario 2)
P(ECC) = 0.0836
Risk = p(ECC) *
Impact
100 3 (Low) 0.35 0.25
200 3 (Low) 0.35 0.25
300 6 (Medium) 0.69 0.5
400 10 (High) 1.16 0.84
500 10 (High) 1.16 0.84
Table 4.6. Risk Values for Attacking CB connecting a generator at Bus 33.
Time
Period
(ms)
Impact
Score
Risk (Scenario 1)
P(ECC) = 0.1157
Risk = p(ECC) *
Impact
Risk (Scenario 2)
P(ECC) = 0.0836
Risk = p(ECC) *
Impact
100 3 (Low) 0.35 0.25
200 3 (Low) 1.16 0.84
300 6 (Medium) 1.16 0.84
400 10 (High) 1.16 0.84
500 10 (High) 1.16 0.84
100
In the scenario 2, when the billing engine is patched, the probability of a successful
attack on ECC drops to 0.0836 for a ”motivated attacker”. Consequently, risk values for
attacking the circuit breaker on bus 30 for T values 100ms, 200ms, 300ms, 400ms and 500ms
drop to 0.25, 0.25, 0.50, 0.84 and 0.84. On the other hand, the risk of attacking circuit breaker
on bus 33 drop to 0.25 for T = 100ms and 0.84 for the remaining cases.
It is essential to note that: first, overall vulnerability status is based on known and
unknown (zero-day) vulnerabilities. In this dissertation, we considered only known
vulnerabilities. If all known vulnerabilities are patched, then their probability of successful
attack becomes zero. Second, the probability of successful attacks directly depends on the path
of the system that is being patched. For example, patching MDM or OMS will reduce the
probability of successful attack to zero as they are in critical attack path to ECC. Patching the
smart meter head server (instead of the billing engine) will have the same effect as patching
the billing engine as they are in the same path and there are no alternative paths. However,
patching the vender specific server will reduce the probability of successful attack to 0.0498
instead of 0.0836. In this case, there are alternative paths to attacking ECC. This means that
patching certain systems should be given higher priority as they have higher impact on the
probability of successful attack.
4.6 Usefulness to Power Engineers
How will risk score help engineers in managing the risk of the grid? The first step of
risk management is to evaluate risk, which we have performed in the previous sections. The
next step is to reduce risk. In order to manage risk, engineers reduce the likelihood and the
impact of an attack. Based on the computed risk, engineers evaluate if the probability of
compromise is contributing more to the risk, they should patch systems that have high impact
101
on the physical domain. And if the impact is high (in the case of ECC as compared to other
systems), they should deploy resources to reduce impact. In the next chapter, we show how to
reduce risk by reducing the likelihood and impact of an attack by taking actions in the cyber
and physical domain.
102
Chapter 5
Risk Mitigation
In this chapter, we present three ways to mitigate risk either by reducing the likelihood
of an attack, or its impact on the system. First, we reduce the likelihood of an attack in the
cyber domain by deciding which system to patch or scan. Second, we evaluate whether the
Photo Voltaic system in the grid reduces risk. Finally, we show how to reduce the risk by
implementing a governor system (policy server) that prevents cyber attacks from propagating
to the physical domain.
5.1 Reduce the Likelihood of Attack in the Cyber Domain
The security of a critical infrastructure such as smart grid is of significant concern
because CPAs are becoming a frequent occurrence. Cybercriminals compromise various
functions of a cyberinfrastructure to control physical processes maliciously. It is the system
administrator’s goal to find vulnerabilities in the cyber functions and patch them before they
are compromised. Unfortunately, limited resources and a large attack surface make it difficult
to decide which function to protect in a particular system state. In this section, we tackle the
problem of resource allocation in the smart grid system by proposing a tool, Reinforcement
Learning-Bayesian Attack Graph for Smart Grid System (RL-BAGS), which provides
functionality to power engineers to compute optimal policies on regular intervals about whether
to SCAN or PATCH a particular cyber function of the grid. By performing effective patch
management in the cyber domain, we avoid the system from getting compromised based on
the known risk, i.e, avoidance phase of resilience.
103
5.1.1 Motivation
While many organizations face the challenge of cyber-physical threats, we highlight
some of the challenges faced by companies providing security solutions to critical
infrastructures. Most companies provide solutions to implement a Security Operations Center
to manage risk and provide better security for the smart grid infrastructure. The problems faced
by engineers are: 1) how to analyze the system status and information flow in real time, 2)
discover and analyze the vulnerabilities associated with the system functions, and 3) decide
when to patch the discovered vulnerabilities.
To address these challenges, we proposed the BAGS tool (in sub-section 4.3.3) to
quantify the likelihood of an attack based on vulnerabilities. BAGS considers functions,
network architecture, applications, and a vulnerability report as input and generates three
Bayesian Networks (BN): Functional Bayesian Network, Network Bayesian Network, and
Vulnerability Bayesian Network. BAGS enables system admin (SA) to analyze how a failure
of a network component controlling a particular power grid functionality propagates from the
cyber to the physical domain and quantify the probability of its compromise using CVSS
scores.
In BAGS, we assume SA has already performed a vulnerability assessment of different
system functions, and we use a vulnerability report to compute the probability of compromise
of a particular component. The problem is: it is not possible to perform the vulnerability
assessment of all the functions simultaneously. It is costly because it affects the normal
working of the functions, and in such critical systems, it is not advisable to place functions on
standby for testing. Furthermore, to patch vulnerable components takes time, and for that
duration, components might not be functional. When SA has multiple vulnerable components,
it is hard to make a choice which component to patch first. Two actions are corresponding to
104
each component: 1) SCAN-X to find out whether component X is vulnerable, patched or
hacked and 2) PATCH-X to patch the vulnerabilities of system X. To perform either of these
actions, power engineers must place the function in standby mode, and it is not possible to stop
the functionality of various functions simultaneously. Furthermore, they know the internal
network but not the position where an attack might have occurred until the symptoms of the
attack are observed. Because of the limited budget for time duration and the basic state of
functions is unknown, engineers cannot not patch vulnerabilities efficiently. In order to patch
systems, engineers must know whether systems are vulnerable or not. And to discover their
vulnerabilities, they must perform assessment and then patch management. Our approach
assists them to take this action optimally.
5.1.2 System Description
In this sub-section, we describe the tool design for RL-BAGS. We demonstrate how to
model the grid in the form of the Markov Decision Process (MDP). Finally, we describe what
actions a SA (learning agent) can perform, the different states of the system, and how rewards
are calculated corresponding to each cyber function.
RL-BAGS. In RL-BAGS, we provide the functionality to compute optimal policy for
the grid using RL algorithms represented by BAGS. The optimal policy represents the set of
actions that SA should take when the system is in a particular state to maximize the discounted
future rewards. The remaining functionality of the tool is the same as BAGS.
Agent. The system has only one agent who performs actions on the system: defender
(SA). Although there could be multiple defenders trying to protect the system, we consider
only one agent that abstracts all other defenders. The main goal of the defender is to patch all
the functions of the system. The defender has no idea of the state of the system. She should
105
scan the nodes to know their status and perform the patching. She is allowed to perform one
action on a single node at a time due to cost constraints because it is impossible to keep
functions in standby mode. We have not considered an adversary in this study. Instead, we
specify a function state as HACKED to understand the behavior of the defender when functions
are compromised.
Actions. The agent has two actions at disposal: SCAN-X and PATCH-X. X denotes
the symbol of the function considered by the SA. In total, we have 7 functions in the graph.
Therefore, SA has 14 actions. A denotes the action set:
{SCAN-S1, PATCH-S1, SCAN-S2, PATCH-S2, SCAN-S3, PATCH-S3, SCAN-S4,
PATCH-S4, SCAN-S5, PATCH-S5, SCAN-S6, PATCH-S6, SCAN-S7, PATCH-S7}.
System. Functional Bayesian Network (FBN) built in the BAGS (see Figure 4.6)
represents the state of the system. The node represents a function of the SGS such as Smart
Meters (S1), Outage Management (S4), etc. The edges represent the flow of information from
a function to another. Each function in the FBN consists of network components, and each
network component has some vulnerabilities associated with it. Initially, we assume SA has no
idea about the status of all functions, and she must learn by performing actions. A function can
stay in any one of the following states:
{UNKNOWN, VULNERABLE, PATCHED, HACKED}
The initial state (si) of the system is when all the functions’ status is UNKNOWN. SA
does not know the status of any function. The terminal states (st) of the system are as follows:
TERMINAL STATE 1: functions’ status are PATCHED
TERMINAL STATE 2: function S7 is HACKED.
106
In the first terminal state, SA will receive reward 500 when she can PATCH all the
systems. If some systems are HACKED, she cannot control their functionality and therefore,
she should remove those system instances out of the network for scanning. The status of the
system will not change once they are hacked until SA patches it.
In the second terminal state, if the attacker can compromise S7, she can control the
power dispatch, and SA has failed in protecting the system. SA will receive a reward of -500.
Since we have 7 (y) functions in the graph, and each function can be of any 4 (x) states, the
state space will consist of (xy) 47=16384 states. An example of a state:
s: {S1: VULNERABLE, S2: VULNERABLE, S3: VULNERABLE, S4: VULNERABLE,
S5: VULNERABLE, S6: VULNERABLE, S7: VULNERABLE}.
Function Significance. The significance of each function of the grid is precomputed
in the system and it changes with the change in one of the following factors:
1. Asset value (AV): The value of the asset to the SGS.
2. Rate of Occurrence (RO): The frequency at which attacks happen on a function.
3. Risk Exposure (RE): The amount of loss to the grid if a function is compromised.
4. Probability of Compromise (PoC): The average exploitability score that is
calculated using CVSS score based on the vulnerabilities.
5. Influence of Function (IOF): The influence of a function on the grid.
6. Cost to Patch (CTP): To patch the cyber function’s network components.
The rewards of a function f is computed as:
Importancef = AV×RO×RE×PoC×IOF – CTP (1)
107
The probability of compromise is frequently changed because the vulnerabilities are
dynamic. Initially, this value is not known because SA does not know whether functions are
vulnerable or not. SA scans a function to discover its status. In the simulation, SA acts
randomly on a state, and rewards are observed. SA must know the status of the function before
taking PATCH action. The function importance should be calculated either through data using
statistical measures or based on SA’s experience.
State Transition and Rewards. Table 5.1 represents how the status of a function
changes and what rewards SA will receive for performing an action. If SA performs SCAN
function and the previous status of the function is HACKED, the new status will remain
HACKED. SA receives -200 rewards since it is a bad move. Similarly, if the previous state is
VULNERABLE, and the next state is also VULNERABLE, it is a bad move, and SA receives
-200 rewards. If the previous state is UNKNOWN, with random distribution, we decide
whether it is PATCHED, VULNERABLE, or HACKED. If it is PATCHED, SA receives 0
rewards because it was not worth scanning it, and if it turns out to be VULNERABLE or
HACKED, SA receives function importance rewards. Note, it is possible that a node which is
in the PATCHED state can be found in VULNERABLE or HACKED state.
5.1.3 Reinforcement Learning (RL)
Reinforcement learning is an area of machine learning that explains how agents should
act in an environment to maximize their cumulative rewards. RL algorithms are modeled as
MDP [113]. The agent interacts with the environment in discrete time steps by performing an
action (see Figure 5.1). At each time step, it receives a reward (an observation) and the
environment moves to a new state where another action is chosen. The algorithm follow the
same routine until a final state is reached or algorithm converges.
108
Table 5.1: Function state transition and reward function
Action Previous State Next State Rewards
SCAN
HACKED HACKED -200
VULNERABLE VULNERABLE -200
UNKNOWN VULNERABLE importance
UNKNOWN HACKED importance
UNKNOWN PATCHED 0
PATCHED VULNERABLE importance
PATCHED HACKED importance
PATCHED PATCHED -200
PATCH
HACKED PATCHED importance
VULNERABLE PATCHED importance
PATCHED PATCHED -200
UNKNOWN UNKNOWN -200
- ANY TERMINAL STATE 1 500
- ANY TERMINAL STATE 2 -500
Figure 5.1: Q-Learning and SARSA learning Iteration. Agent (SA) performs an action on (s)
system state represented by FBN. The system moves to a state (s’). The values s, a, s’ are
given to the reward function that computes reward r’ and then to Q-value update function.
Finally, agent observes s’ and maximum value of Q(s,a).
109
Q-Learning. It is a model-free RL algorithm [114]. It is proven that it converges to an
optimal policy for a finite number of states, and actions for a single agent. The model-free
means we do not need state transition functions to move from one state to another. In this
algorithm, the action-value function, Q, directly approximates to optimal action-value function,
Q*. Q value is assigned to each state-action pair. It is an off-policy temporal difference
algorithm. It does not depend on the policy followed by the agent. The agent chooses an action
in the environment based on the -Greedy Policy. Q-learning could be sensitive to local optima
problems. It forces the agent to take some actions at random (non-optimal) with probability
(exploration) and optimal action (exploitation) with probability (1− ). This policy allows the
agent to explore all the states possible, and eventually, it will tune the Q-values by choosing
the best actions that maximize the discounted rewards. After performing an action, the system
moves to a new state, and the agent receives a reward. The reward is used to update the Qt (s,
a) values corresponding to each state and action. The total rewards Q is computed according to
(2) and update rule for Q-learning is (3):
1
( , )
n
t
t t t
t
Q r s a
−
=
(2)
1
( , ) ( , ) max ( , ) ( , )
t t b t
Q s a Q s a r Q s b Q s a
+
= + + −
(3)
where [0,1] is the learning parameter, and [0,1] is the discount factor to prefer future
rewards. The above rule updates the Q value for the last state (s) and action (a) pair with respect
to the observed outcome state (s’) and reward (r’). The optimal actions corresponding to a
particular state is determined by:
at = arg maxa Q(st ,a) (4)
110
SARSA ( ) Learning. SARSA stands for State-Action-Reward-State-Action. It is an
on-policy model-free temporal difference algorithm. In contrast to Q-learning, it does not find
the best action possible that maximizes the rewards; instead, it continues to choose the next
action using the same policy. In SARSA learning, the action is taken on a specific state and
reward is received, the system moves to a next state, and using the same policy another action
is chosen. The update rule for SARSA learning is:
1
( , ) ( , ) ( , ) ( , ) ( , )
t t t t
Q s a Q s a r Q s a Q s a e s a
+
= + + −
(5)
1
1
( , ) 1 (if and )
( , )=
( , ) (otherwise)
t t t
t
t
e s a s s a a
e s a
e s a
−
−
+ = =
(6)
The idea of using eligibility trace is to apply temporal difference prediction to state-
action pairs instead of states. refers to the use of eligibility trace. When =0, there is a one-
step backup return as in Q-learning. When =1, there is one episode backup return as in Monte
Carlo. In SARSA learning model, =0.4. et (s, a) in (5) denotes the trace of state-action pair.
5.1.4 Experiment
Assumptions. In our simulation, we made the following assumptions:
1. In the grid, the availability of the data is a measure problem. Therefore, for
computing the rewards of the functions, we logically assign the values to the
variables (AV, RO, PoC) in a way that important function value is more.
2. In real time, PoC values are adjusted when SA scans a cyber function because
without scanning she would not know the set of vulnerabilities and hence PoC
value. In the simulation, we calculate PoC values in advance with their function
111
importance because there is a limitation of the database to compute function
importance.
3. SA must compute optimal policies after some regular interval because the
probability of compromise of the system changes over a period of time. It is because
either vulnerability of the system change or SA has patched the network
components. We have not considered the change in PoC values.
4. The smart grid is a complex system that contains many network components.
Including all network components in the simulation will increase the state space
exponentially and is impossible to track. Therefore, we grouped the network
components under the function they support. We use the FBN as the state of the
system.
5. In the simulation, we are computing the optimal policy for the SA. In order to
simulate an attacker, we introduce HACKED status for functions randomly during
simulation so that to learn the behavior of the SA in the presence of HACKED
states.
6. We have not considered the time it takes to patch or scan a system. We assume it is
to be the same for all functions.
Experiment Setup. We use the BURLAP [115] library to implement the Q-learning
and SARSA learning algorithm. Our main task was to model the problem in the form of MDP
without state transition probability function and implement both learning algorithms. We
implemented the StateWorld class to represent the model of the system. In StateWorld, the
function takes state and action as function parameters and returns the next state. We
implemented the Reward and Terminal state interface that forms the part of the StateWorld.
The Q-learning algorithm runs 100,000 episodes, and SARSA learning algorithm runs 300,000
episodes to explore the state space and learn optimal policy. In each episode, the agent learns
112
from the initial state until it reaches one of the terminal states and updates the Q-values within
each episode. At the end of the simulation, the Q-values converge to an optimal value, and we
can compute optimal policy using Eq. (4).
Agent Learnning Procedure. Each episode starts from the si state. The defender
randomly selects a function to SCAN (SCAN-SX) and discover whether it is VULNERABLE
or HACKED. The system moves to the next state si+1 where the function SX is in the discovered
state. And defender receives some reward for performing that action in state si and moves to
state si+1 according to Table 5.1. The reward received by the defender depends on the
importance of the function on which action is performed, what action performed and its
discovered status. If the si+1 state is a terminal state, the episode ends. We store the action
sequence, state sequence, and rewards of each episode to plot the learning graph.
Figure 5.2: Each node represents a function and status (V: VULNERABLE, H: HACKED, U:
UNKNOWN, P: PATCHED). Admin performs action SCAN-S2 and discovers that it is
Hacked, the system moves to next state where S2 is Hacked and admin receives reward:
function importance.
Figure 5.3: Agent chooses action PATCH-S2 . The status of the function S2 changes to P and
system moves to next state described in right side figure. Admin receives reward in terms of
the importance of S2.
113
An example of how defender behaves and learns. Suppose the initial state of the
system is described in Figure 5.2 (left side). Initially, the status of S2 is UNKNOWN. SA takes
action SCAN-S2 and discovers that S2 is in the HACKED state. The system moves to the state
described in Figure 5.2 (right side), and SA receives reward importance of S2. SA updates the
Q values of the state-action pairs using the reward received according to (3) and (4) equations
in Q-learning and SARSA learning algorithms, respectively. Now, SA takes PATCH-S2
action, and the system moves to state Figure 5.3 (right side) where S2 is patched. SA receives
a much higher reward for patching S2 (depends on the importance of the node) and updates its
Q values. Similarly, the simulation continues until the final state of the system is reached. The
episode ends when either of the TERMINAL STATE 1 or TERMINAL STATE 2 have been
reached and SA receives a reward (Table 5.1).
5.1.5 Simulation Analysis
The simulation results show the average reward per episode against the number of
episodes for the Q-learning (see Figure 5.4, 5.5, 5.6) and SARSA (see Figure 5.7) learning
algorithm. We calculate the moving average reward per 300 episodes over: 1) 100,000 episodes
for Q-learning for value 0.2 (Figure 5.4), 0.6 (Figure 5.5) and 0.8 (Figure 5.6), and 2) 300,000
episodes for SARSA learning. If we set value to any other constant between [0,1], still the
algorithm will converge to the optimal policy. The exploration (with probability ) diminishes
over time and policy becomes greedy and thus optimal. The only difference is that the agent
takes random action with high probability even if the optimal policy is learned. In Figure 5.4,
the agent takes optimal action with probability 0.8 (and 0.2 for random actions), therefore we
see stable average rewards after some episodes are completed. In contrast to Figure 5.4, in
Figure 5.6, the agent still takes random action when an optional policy has been computed.
114
Figure 5.4: Q-Learning: Plot of the moving average of the 300 average rewards per episode
for 100,000 trials with -Greedy policy for exploration and exploitation =0.2, learning rate
=0.2 and discount factor =0.2.
Figure 5.5: Q-Learning: Plot of the moving average of the 300 average rewards per episode
for 100,000 trials with -Greedy policy for exploration and exploitation =0.6, learning rate
=0.2 and discount factor =0.2.
Figure 5.6: Q-Learning: Plot of the moving average of the 300 average rewards per episode
for 100,000 trials with -Greedy policy for exploration and exploitation =0.8, learning rate
=0.2 and discount factor =0.2.
115
Figure 5.7: SARSA-Learning: Plot of the moving average of the 300 average rewards per
episode for 300,000 trials, learning rate =0.2, lamda =0.4 and discount factor =0.2.
One can easily make the difference between the Q-learning (off-policy) and SARSA
(see Figure 5.7) learning (on-policy) approach. The average reward per episode in SARSA
learning is more than Q-learning. This shows that SARSA learning is useful when we want to
optimize the reward for SA that is exploring the state's space of the system (not exploiting the
best possible action). SARSA follows the current exploration policy which may or may not be
greedy. And it is possible that SARSA will find different policies than Q-learning (in the case
when following exploration strategy leads to huge penalties). For example, if a function status
in our model is VULNERABLE and optimal action may not be to patch this function for now.
But the function is in VULNERABLE state and attacker can hack that function, SARSA
learning will discover this and PATCH that function.
On the other hand, Q-learning will PATCH that function if it is the optimal action in
the current state (return maximum reward for the state and action pair). This is the reason why
SARSA learning coverages and reaches terminal state quickly and may have a different optimal
policy as compared to Q-learning. Q-Learning may take a long time to find an optimal policy
as compared to SARSA learning and it always computes the optimal policy at a particular state
[114]. SARSA learning algorithm should be used by the agent who wants to explore different
116
strategies quickly and optimize the reward value at the same time. Once the Q-values
converges, we calculate the optimal policy using Eq. 4. It returns the action corresponding to a
state that maximizes the Q-value. This will be stored within the tool after computation so that
SA should be able to access which action to take in a particular state. Consider state:
s: {S1: UNKNOWN, S2: UNKNOWN, S3: PATCHED, S4: VULNERABLE, S5:
PATCHED, S6: UNKNOWN, S7: VULNERABLE}.
Using Q-Learning results (from the simulation of Figure 5.6), the best action is S7-
PATCH (Electricity Control Center) with Q-value for this state is 282.2781. Since S7 is
vulnerable, it is beneficial to protect S7 first; otherwise, SA will receive -500 rewards if it gets
compromised. It is also possible that the algorithm learns to patch some different function
because that may lead to compromising the S7 function. It depends on the importance of the
node how the algorithm learns the optimal policy. If we give importance to nodes that are
present in the starting of the graph, the agent will learn a policy that SCAN and PATCH those
functions first even if functions present in the end are vulnerable. It is recommended for the
SA to choose function importance values carefully to learn effective optimal policy. SA should
implement this algorithm after every regular interval so that to incorporate changes in the
system such as the change in vulnerability, the importance of the functions, frequency of
attacks occurred on the system, etc. Furthermore, SA learns the optimal policy for the given
number of timesteps. Since there are more than 16000 states possible theoretically, but in
reality, in a given period, the system will be in a few states. It is efficient to learn the policy for
the given number of timesteps.
The question arises: since smart meters are edge devices and should be easier to patch
as compare to other system components, so why we place them in the model? The reasonable
explanation is as follows. The number of smart meters in a zip code is in the order of 50,000+
117
for a power utility (depends on the number of customers). In order to patch all of them is a
huge task. Although power utility performs patch management remotely, they have to do it
without obstructing their functionality. Usually, for a power utility, all smart meters have the
same configuration and software version. A utility must patch all of them because any one
smart meter is left unpatched, an adversary can use it to enter into the system.
Moreover, in order to compromise smart meters, an adversary performs worm
propagation attack to exploit a specific vulnerability and install backdoors as we have seen in
the case of Ukraine power grid attack 2015 [5]. We have demonstrated this capability by
performing worm propagation attack in the Attack-Defense approach (in Chapter 3) to control
distributed energy resources units. Also, power engineers may not know whether they are
vulnerable, or they might have zero-days with the current version of the software installed. For
such reasons, we have included a smart meter in the analysis. It is one of the entry points to the
system and can be used to perform cyber-physical attacks such as un-safe generation (DER
compromise) and load (load drop or curtailment) manipulation. In the RL-BAGS model,
engineers must assign higher importance value to smart meters in order to give preference over
functions such as meter data management, customer information management system, billing
system, etc.
5.1.6 Conclusion
In this section, we extended the BAGS tool to RL-BAGS with the motive to compute
optimal resource allocation policy for the smart grid security in the presence of cyber attacks.
We implement two RL algorithms over BAGS, Q-Learning, and SARSA learning, on the
generated BN to learn optimal policies. The results showed that it is possible to learn the policy
using model-free RL algorithms. The most critical parameter is function importance, which
must be computed carefully. We discussed that Q-learning provides an optimal policy by
118
exploiting the best possible action instead of following the current exploration policy in case
of SARSA learning. SARSA learning should be used by the attacker to explore all the states
quickly and optimally.
We can learn the optimal policy for the attacker similarly as of defender. An adversary
has to try all the options to decide which one was optimal because she does not have complete
knowledge about the system (partially observable). Unlike the defender, she does not know the
system functionality until she scans the system. Since the defender is the SA, she knows the
functionality and fundamental importance of the system (without considering the effect of
vulnerabilities). SA should continuously learn the optimal policy for the change in the
vulnerabilities because it will change the importance of the function. We can also build a
recommendation system where SA will provide their recommendations to change function
importance manually. Power engineers should incorporate this tool into their system to make
optimal decisions at regular intervals. RL-BAGS not only assists in analyzing system resilience
but provides actions to maintain resilience.
Figure 5.8: The modified IEEE 39-bus power model. This model includes a PV system at bus
30 instead of the generator.
119
5.2 Reduce the Impact of an Attack in the Physical Domain
In this section, we discuss how integrating a Photo Voltaic (PV) system within the
power grid could reduce the risk of cyber-physical attacks and improve its resilience. The
generator on bus 30 is replaced by a PV system (Figure 5.8) to analyze the impact of
manipulating circuit breakers (discussed in sub-section 4.4) when PV is integrated with the
system. The circuit breaker is opened after one second of the simulation and closed after the
period T, which is varied from 100ms to 500ms in 100ms steps. Figures 5.9 and 5.10
demonstrate the results of attacking the breaker connecting the PV system on bus 30 (250MW).
It is clear that the attack is mitigated in this case as the frequency is within the standard
deviation values (59.97 - 60.03 Hz) and rotor angles are stable, and the system maintains
synchronism. The frequency and rotor angle results of the remaining buses exhibit similar
behavior but they are omitted from the figures for clarity of presentation only.
The impact of the cyber-physical attack is reduced due to the fast responses of voltage
and current controllers of the PV system after closing the circuit breaker. The voltage controller
of the PV system tries to restore the voltage magnitude faster than the Automatic Voltage
Regulator (AVR) in synchronous machines. That will stabilize the voltage and support the
synchronizing torque coefficient in the power system, including the PV system [115].
Moreover, the fast current controller in the PV system helps in reducing the power and angle
oscillations of the power system [116]. If the probability of a successful cyber attack is high,
and there is no impact on the smart grid, then the risk is reduced to zero. Even though the cyber
attack may succeed, however, the attack fails and resilience of the system is improved. For
future work, a more in depth contingency analysis can be performed to study the impact of
changing the location of the PV and increasing the amount of power generated from PV on the
resilience of the grid.
120
Figure 5.9: Each figure demonstrates different duration during which the breaker remained
open. The breaker is opened after 1 second of the simulation. PV is used in this simulation at
bus 30.
Figure 5.10: The rotor angle of the system on four buses when the attack happens. Attack
performed by opening the breaker of generator on bus 30 and closing it after time T. Each
figure demonstrates different duration during which the breaker remained open. The breaker
is opened after 1 second of the simulation. PV is used in this simulation at bus 30.
121
5.3 Reduce the Impact using Governor in the Cyber Domain
In this section, we present an Intelligent Governor for Cyber-Physical Systems
(IGNORE) to limit the success of attacks when a cyber system has been compromised and
leveraged by an adversary to mount attacks on the physical system. Governor is based on the
principle of security reference monitor. The primary motive is to prevent a cyber attack from
propagating to the physical domain after evaluating commands that are issued by a
compromised higher-level function. We describe the methodology to build IGNORE system
and present its usefulness in the smart grid infrastructure by developing a Governor for its DR
functionality. The underlying principle for generating governor’s security policies is the
requirement and safety property. Based on this principle, governor evaluates whether
commands issued by a cyber system are required in the system as well as safe for the system.
By implementing a governor, we relatively reduce the attack surface on a system’s higher level
function of a CPS.
Our key contribution is to understand how to design a governor system for the grid’s
DR functionality, different attacks prevented, and factors that should be considered by power
engineers while developing this system and show its effectiveness through empirical results.
The methodology can be used to develop a governor for a function of other CPSs. This work
sheds light upon how a higher-level functionality of a CPS is protected by analyzing the
system’s cyber and physical aspects even when some parts of the system are compromised.
5.3.1 Basic Concept
In this section, first, we provide the fundamental concept of the governor and second,
we discuss various attack scenarios in the different CPSs where governor can/cannot be used
and why.
122
Table 5.2: Attack Scenarios where Governor might be useful.
Access Feedback
Loop
Exist
Attacks Governor
?
(Where?)
Why?
Directly
(Cyber
System or
Mobile
App)
YES
1. Using Botnet to turn
on/off IoT devices such
as Thermostat, RTUs
for pipeline, etc.
[49,90,33]
YES
(Local
edge
device)
Such devices change their
functionality based on
sensor data or state
information. For instance,
the thermostat does not
turn off the heat if the
temperature is below X
value. A circuit breaker
trips if the current is
beyond a threshold value.
Governor can be used to
enforce the security policy.
2. Using a cyber system in
Power utility, an
adversary controls
circuit breakers. [32]
YES
(Local
edge
device)
NO
3. Using Botnet to turn
on/off IoT devices such
as Microwave, or other
high wattage devices.
[49,90]
NO
Such devices do not take
actions based on the state
information. If someone
wants to use Microwave,
she turns on Microwave
either physical access or
through mobile app
remotely. Governor cannot
be used to prevent attacks
that control Microwave or
other devices maliciously.
4. Using a cyber system in
Power utility, an
adversary turn on/off
storage units or DERs.
NO
Indirectly
YES
5. Using a cyber system in
Power utility or DRAS,
an adversary sends
malicious signals to
control DR
functionality [79].
YES
(Where
commands
are issued)
Functionality is
implemented by a cyber
component based on the
state information. For
instance, DRAS sends load
curtailment signals to
customers to curtail load in
response to a contingency.
Governor enforces the
required and safety
property to commands that
are issued by DRAS to
verify the need for those
commands. Similarly, in
the case of autonomous
cars, governor can be used.
YES
6. Using a cyber system in
Power utility, an
adversary sends
malicious signals to the
AGC [15].
YES
(Local to
AGC)
YES
7. Using a cyber system in
the vehicle
infrastructure, an
adversary sends
malicious signals to
V2X Head Unit of a car
[88].
YES
(Before
V2X
headunit of
a car)
The concept of the security reference monitor allows a system to provide security to its
functionality in order to operate at the boundary of trusted and untrusted domain [130]. It has
the following properties: cannot be bypassed and altered, and can be verified and tested.
123
Governor is based on the concept of the reference monitor which is responsible for evaluating
the commands, that are issued by a system’s higher-level function, whether they are required
and safe for the local system. It makes such decision based on the state information it maintains
through communication with other governors and system’s components. Table 5.2 discusses
various scenarios where governor is useful and why.
In Attack 3, an adversary has compromised a botnet of microwaves and network printer,
she sends turn on/off commands remotely. Such type of attack is not possible to prevent
through governor since devices do not involve feedback loop before initiating or closing their
service. It is difficult for a governor to decide whether commands are legitimate because
commands are originated from the legitimate compromised server. For similar reasons, Attack
4 is not possible to prevent by a governor, where an adversary has compromised smart meters
to control DERs and end devices in a home. Since an adversary has compromised a system
through which she directly (legitimately) controls edge devices, either through a cyber system
or mobile app, it is difficult to detect whether commands issued are required and safe for the
system or the device itself.
In case of a thermostat (Attack 1), it decides whether to turn on/off the heat based on
the current temperature. If a customer physically specifies the rule on the device to not turn off
the heat if the temperature is below X and an adversary sends malicious command remotely to
turn off, this type of attack can be prevented by a governor. Governor placed on the device
containing temperature sensor will enforce the security policy to not to turn off the heat if the
temperature is below X. For the similar reason, Attack 2 can be prevented since a circuit
breaker takes action based on the current readings from the system. In such scenarios, we place
a governor near the edge devices and enforce security policy to prevent attacks that control
their functionality directly.
124
In some cases, the governor can be used to prevent attacks at the locations from where
commands are originated based on the state information. For instance, in Attack 5 and 6, DRAS
in the cyber domain sends load curtailment commands to the customers who have enrolled in
the DR functionality to curtail load in response to a contingency. An adversary compromises
the DRAS server and sends malicious load curtailment commands. Governor placed adjacent
to a DRAS will capture these commands and enforce security policy before allowing
commands to go through. It checks whether load curtailment command is required and safe for
the system based on the state information it maintains. In this scenario, a governor is placed
where commands are originated, not near the edge devices. Similarly, we can implement a
governor at the gas or oil pipeline to decide whether commands are required and safe based on
the physical properties of the gas or oil flowing through a pipeline, we can avoid malicious
pressure increase or decrease commands.
Finally, in Attack 7, an adversary has compromised the remote car infrastructure and
control different Electronic Control Unit (ECU) of the car through Wi-Fi or cellular network
[88]. Since a ECU’s functionality depends on the physical properties of a moving car, we can
prevent malicious commands from reaching the end system if we implement a reference
monitor that enforces the security policies.
All these attack scenarios elucidate that it is possible to prevent those attacks where
commands are issued from a compromised cyber domain that is responsible for controlling a
functionality and end devices must have a feedback loop to make decision. In order to
understand how to design a governor system for a CPS functionality, we narrow down the
focus on understanding its application for protecting the DR functionality of the grid. Before
discussing DR governor, let’s discuss what are the attacks possible through a compromised DR
functionality.
125
Figure 5.11: An adversary compromises Demand Response Automation Server and a cyber
system in SCADA system to manipulate power demand & supply.
5.3.2 Attack via Demand Response Functionality
In this sub-section, we discuss the threat model and different attack scenarios that
originate from the compromised DR functionality to manipulate power demand and supply.
Demand Response. DR is a smart grid technology that involves customers to adapt
their electricity consumption from normal consumption patterns in response to contingencies
to prevent the grid from getting destabilized. Power Utility (PU) implements Demand
Response Automation Server (DRAS) to perform DR functions in various areas. Each DRAS
is assigned a set of zip codes and receives the power consumption readings from those areas
via aggregate load servers. PU and DRAS issue DR/AMI commands to perform load
curtailment, send price signals, and control (DER). PU responds when contingency happens in
an area and response needs to be taken in a different area. DRAS responds when contingency
happens and response needs to be taken in the same area. If DRAS is not able to mitigate
contingency, it requests PU to take action. Then, PU issues DR/AMI commands to a DRAS.
126
DRAS computes the percentage of customers (who are enrolled in DR program) to forward
commands so that area frequency is within the normal range.
Threat Model. An adversary scans the business network endpoints accessible over the
internet to find their vulnerabilities. She also performs a social engineering attack via spear-
phishing emails to know the credentials of employees. Once she knows the credentials, she
impersonates employees and enters the business network. Now she compromises systems of
the grid’s business network [125] to find which system is responsible for the DR functionality
and power supply (similar to the Ukraine Power grid attack [5]). She gains control over those
systems to manipulate the power demand and supply at a strategic time by leveraging the
information from the ISO website [36]. We assume an adversary has already compromised a
cyber system of the Power Utility (PU) and DRAS from where she sends malicious commands
to manipulate the power demand and supply remotely in different areas. Her motive is to either
destabilize the grid or increase the operational cost of the PU by performing irrelevant load
shedding or curtailment, increasing power supply via DER units, and disconnect circuit
breakers that connect generators to the grid. There are two cateogories of attack that are
possible through a compromisd DR functionality: Manipulating Power Supply
andManipulating Power Demand.
Manipulating Power Supply (MPS)
Case 1: Distributed Energy Resource (DER) Manipulation: An adversary has
compromised a DRAS to send malicious commands to X% of smart meters in an area to: a)
disconnect DERs from the grid when they are required; or b) dispatch DERs power when they
are not required.
127
The adversary leverages the power demand and supply information from the ISO
website to decide the timing of this attack to cause maximum damage. By controlling DERs
functionality, the adversary controls the power supply to manipulate area frequency.
Case 2: Disconnect Circuit breaker: An adversary has compromised a cyber system in
a PU to send malicious disconnect commands to X% of breakers with motive to disconnect
generators from the grid.
Circuit breakers, controlled by relays, are used to automatically protect the system from
high fault currents. Relays make a decision to either open or close a breaker based on
processing measurements such as voltage level, power flows and area frequency. If an
adversary modifies the settings of the relay, the grid becomes susceptible to CPAs. By
disconnecting circuit breakers, the attack disconnects a generator from the grid. Thereby
reducing the power generation and supply in an area which incurs cost to a PU in terms of
exhuasting contingency resources to satisfy a reduced generation.
Manipulating Power Demand (MPD)
Case 3: Malicious Load Curtailment: An adversary has compromised a DRAS or cyber
system in the PU to send malicious commands to X% of smart meters in an area to perform
irrelevant load curtailment.
Load curtailment is a DR function to request customers who are enrolled in the DR
program and they participate to reduce their power consumption in response to a contingency.
It is treated as a regulating reserve. PU has the power supply and consumption information of
all areas and DRAS has information about its own area. If there is any contingency such as low
power generation or high power demand, they know how much load to curtail from a specific
area. DRAS server decides the number of customers in its area and how much load each
128
customer should curtail on a rotating basis [104]. It satisfies two objectives: 1) fairness since
the load is spread across all the customers, and 2) the impact of load curtailment on a customer
is reduced. Once decided, it sends load curtailment commands to all customers.
Case 4: Malicious Load Shedding: An adversary has compromised a DRAS or cyber
system in the PU to send malicious commands to X% of smart meters in an area to perform
irrelevant load shedding.
Load shedding is used to reduce the load in response to contingency where the grid is
not able to provide generation to satisfy demand. This functionality is used in extreme
conditions where there is a huge generation loss in an area. Similar to load curtailment, PU or
DRAS decides whether there is a need for load shedding based on the system information and
accordingly commands are issued. DRAS decides the number of customers in a zip-code to
send disconnect signals to shed load.
Area Wise Attacks.
We study six variations of the manipulation of load attacks, which are based on the
following factors: Contingency, Response, and Attack (CRA). It states in which area
contingency happens, in which area power utility responds to the contingency and finally, in
which area an adversary performs load manipulation attack. Suppose X and Y represent two
areas or zip-codes. Table 5.3 show six different attack scenarios.
Attack 1: Suppose A% generation loss happens in area X. Since the response is taken
in the same area, it is taken by the DRAS X. DRAS X asks customers to curtail B% of the load
to maintain area frequency with secure limits. An adversary leverages the information about
the contingency and issues extra C% load curtailment commands from compromised DRAS
X.
129
Table 5.3: Area wise attacks based on Contingency, Response and Attack (CRA) principle.
Suppose X and Y represent two areas or zip-codes as in Figure 5.12.
Attack No. Contingency Area Response Area Attack Area
1 X X X
2 X X Y
3 X X X,Y
4 X Y X
5 X Y Y
6 X X,Y X,Y
Attack 2: Similar to Attack 1. An adversary issues C% load curtailment commands from
compromised DRAS Y in the area Y, which is different from X.
Attack 3: Combination of Attack 1 and 2, where an adversary issues load curtailment
command in both area X (C%) and Y (D%).
Attack 4: Suppose A% generation loss happens in area X. Since the response is taken
in the area Y, it is taken by the PU which issues commands to DRAS Y. Then, DRAS Y asks
customers to curtail B% of the load to maintain area frequency with secure limits. An adversary
issues C% load curtailment commands from compromised DRAS X in area X.
Attack 5: Similar to Attack 4. An adversary issues further C% load curtailment
commands from compromised DRAS Y in the area Y.
Attack 6: Combination of Attack 1 and 5, where system responds in both areas X, Y,
and an adversary issues malicious load curtailment command in both area X (C%) and Y (D%).
The primary idea behind different attack scenarios is that DRAS provides local safety.
It checks whether there is any contingency in the area it is responsible for. If yes, it
automatically sends commands to mitigate any contingency. It is unaware of the scenario in
130
other areas, and thus, it can be leveraged by a malicious entity to perform sophisticated attacks
that are hard to detect and stop. An adversary can choose any of these scenarios to confuse the
defender.
5.3.3 IGNORE
In this section, we present an Intelligent Governor for Cyber-Physical Systems
(IGNORE) to limit the success of attacks when a cyber system has been compromised and
leveraged by an adversary to mount attacks on the physical system. We discuss what is
governor, governor design, what are the actions taken by a governor and finally, the
methodology to design a governor for a higher-level function of a CPS.
Governor and its Properties. It is a component that serves to protect a CPS from
attacks that are more severe and frequent than is acceptable by enforcing security policies on
the actions of system’s higher-level functions. It acts as a reference monitor that enforces
security policies. Governor receives state information from different components and can
interact with other governors in order to make a decision. Most importantly, it provides local
safety first. The underlying principle for generating security policies for a higher-level function
is the requirement and safety property. Governor must perform sensitivity analysis to
understand when a set of commands are required to implement in the system and whether
system will be in a safe state after implementing those commands. For instance, load shedding
is used as a contingency reserve and commands must be issued when there is a generation loss
or huge peak demand. If there is no contingency in the system, shedding commands are not
required. This forms the requirement policy. For safety policy, admin must check after
implementing the load shedding command, the area frequency must not go beyond lower or
upper frequency threshold limit. The system admin should understand the functionality for
131
which a governor is designed and then perform sensitivity analysis to drive security policies,
which will be enforced by a governor.
Figure 5.12: Cyber Network for Demand Response. It shows where to deploy Governor to
prevent attacks that are originated from compromised power utility and DRAS servers.
Figure 5.13: Governor Design. IDS: Intrustion Detection System, RTDS: Real time power
simulation tool.
132
Methodology to Design a Governor. The methodology to design a governor for a
higher-level CPS is as follows:
1. Specify a service that governor will protect, even if cyber components that support
this service are compromised.
2. Identify the set of actions issued by cyber components responsible for the service.
3. Specify the impact of the set of actions on the CPS (sensitivity analysis).
4. Understand and state the scenarios when those actions are required and safe.
5. Specify how will a governor evaluate the commands and from where it receives
information about the system to perform evaluation.
6. List the rules to describe the safety and requirement policy for every command.
7. Decide where to place the governor and whether governor sends information to
other governors.
8. Describe how to protect the governor from cyber attacks and how it updates the
secure policy modules remotely so that system admin can add more functional
modules.
5.3.4 IGNORE for Demand Response
In this section, we design a governor system, using the methodology described in the
previous section, to prevent attacks on the grid that are caused through manipulation of DR.
Governor Design. Governor system consists of policy server, power simulator for
sensitivity analysis, and a database (see Figure 5.13).
First, we need to decide where to place the governor for the DR functionality. Since
commands are either issued by the DRAS or PU, we place governors, one at each DRAS and
another at the PU (see Figure 5.12). Policy server receives commands through a designated
133
input port from the DR server. It cannot accept commands from other functionalities. It
interacts with the database from where it gathers the state information about the local power
system. The state information is collected from the aggregators which provides the power
demand and supply information, parent governor (PU governor shown in figure 5.12) and
contingency information from the PU. All DRAS governors send their decision to the PU
governor.
Governor is distributed at different levels because commands are issued by different
systems. The primary motive of the governor is to prevent malicious commands from reaching
end systems. PU governor prevents commands at a higher level as it has the state information
in various areas. DRAS governor is closer to the local area and therefore it is important to
evaluate commands locally. By implementing the governor for DR, relatively, we are reducing
the attack surface of the DR functionality. We prevent all those attacks that are originated from
compromised DRAS or PU components. If an adversary can control home devices directly
(such as in [49]), the governor is unable to prevent those attacks.
Second, policy server verifies commands against two policies: requirement and safety.
Requirement policy checks: 1) whether customers to whom commands are sent have enrolled
in the DR program and 2) verify whether the set of commands are required to be implemented.
Safety policy checks whether the power system will be in a safe state after implementing DR
commands issued. Safety policy makes use of real time power simulation tool, RTDS [128], to
make decisions. RTDS is used extensively in the industry to understand the grid’s behavior in
the presence of contingencies. For the purpose of simulation, we use academic version of the
Power World simulator to compute the safety condition for a sample grid.
Third, after evaluating the commands, governor either allow commands to go through
or block commands since they are malicious. In the latter case, it notifies Intrustion Detection
134
System (IDS) about the malicious commands so that power engineers can perform function
containment to prevent further attacks from propagating in the system. How IDS perform
containment is out of the scope of this thesis, but we present some contermeasures in section
5.3.6. Finally, PU is responsible for updating the governor software securely. We provide
details of how to protect the governor system in section 5.3.8.
Governor Rules. We present the DRAS and PU governor rules to prevent attacks
described in the section 5.3.1. We state commands issued by the DRAS and PU and specify
rules followed by the governor.
Power Utility Governor Rules
PU issues DR/AMI commands either when contingency happens in an area and response
is to be taken in different area or when DRAS is not able to maintain power demand and
supply, it asks PU to take action to mitigate contingency.
Command 1 LOAD_CURTAILMENT (L MW load to curtail to DRAS Y).
Rule 1.1 Required Policy
Rules 1. PU issues this command to the DRAS Y.
2. PU governor verifies whether there is any contingency that led to
reduced generation in some areas or hikes in demand based on the state
information collected in the database.
3. If there is a contingency, ALLOW. It signs the command and forwards
it to the DRAS Y.
4. Otherwise, DISALLOW.
Rule 1.2 Safety Policy
Rules Checked by the DRAS Y governor using safety policy in Rule 5.2
curtailment.
Command 2 LOAD_SHEDDING (L MW load to shed to DRAS Y)
Rule 2.1 Required Policy
135
Rules 1. PU send this command to the DRAS Y.
2. PU governor verifies whether there is any contingency that led to
reduced generation in some areas or hikes in demand based on the state
information collected in the database.
3. If there is a contingency, ALLOW. It signs the command and forwards
it to the DRAS Y.
4. Otherwise, DISALLOW.
Rule 2.2 Safety Policy
Rules Checked by the DRAS Y governor using safety policy in Rule 6.2 for
shedding.
Command 3 DER_DISPATCH (L MW dispatch from DERs to DRAS Y).
Rule 3.1 Required Policy
Rules 1. PU send this command to the DRAS Y.
2. PU governor verifies whether there is any contingency that led to
reduced generation in some areas or hikes in demand based on the state
information collected in the database. It knows whether to use DER to
increase generation for some time.
3. If yes, ALLOW. It signs the command and send it to the DRAS Y.
4. Otherwise, DISALLOW.
Rule 3.2 Safety Policy
Rules Checked by the DRAS governor in the area Y using safety policy in Rule
4.2
DRAS Governor Rules
DRAS issues DR/AMI commands when contingency happens and response is taken in the
same area. If DRAS is not able to maintain power demand and supply, it asks PU to take
action. DRAS computes the percentage of customers (who are enrolled in DR program)
on the rotating basis to send commands in order to mitigate contingency.
Command 4 DER_DISPATCH (X% of customers in area Y).
Rule 4.1 Required Policy
Rules 1. DRAS governor verifies whether the command is issued by the PU. If
PU issued this command, PU governor would have verified the need
of the command and signed it before sending it to the DRAS.
136
2. DRAS governor will verify the signature of the PU governor and
ALLOW the command. If the command is not signed by the PU
governor, that means it is originated from the DRAS, jump to step 3.
3. DRAS Y governor needs to check whether generation loss happened
or about to happen in the area Y. DRAS governor receives the
information about the power supply and demand from local
aggregators and about any contingency in all areas from PU governor.
Therefore, using the state information, DRAS Y governor will verify
whether there is (about to happen) any generation loss in the area Y.
4. If yes, it will ALLOW dispatching the power to compensate the
generation loss.
5. Otherwise DISALLOW.
Rule 4.2 Safety Policy
Rules 1. Since the command is required to implement because of generation
loss in the area Y, the governor checks whether it is safe to execute. In
the area Y, the governor has power consumption readings, power
supply readings, and the area frequency from aggregators.
2. It evaluates whether the power supply increase from X% DERs in that
area will put the area frequency in a safe state. Each area has a certain
power supply PS(t) that can be increased for a certain period before
area frequency crosses the over-protection frequency threshold.
3. If the increase in supply is less than PS(t), it is safe, ALLOW else
DISALLOW
Command 5 LOAD_CURTAILMENT (X% of customers in area Y, LX Load to
curtail at each customer, L MW load to be curtailed signed by PU).
Rule 5.1 Required Policy
Rules 1. If generation loss happens in some area and load curtailment needs to
be performed in the area Y, the command must be issued by the PU;
jump to step 2. If the generation loss happens in the area Y where
commands are issued, jump to step 3.
2. PU issued signed load curtailment command to the DRAS Y. When
DRAS Y issues commands to all customers, DRAS Y governor
captures these commands and verifies the signature of the PU governor
for the recent timestamp and verifies the total load that has to be
curtailed using the signed information. If ∑ 𝐿𝑥
𝑥 𝐿 , ALLOW else
DISALLOW.
3. By leveraging the state information from the database, DRAS Y
governor receives whteher there is any contingency in the same area.
If there is no contingency in the same area, discard commands.
137
4. If there is a contingency in the area Y, each governor has Z% (in
addition to the generation loss) to which it will allow load drop
commands to go through. This will give the maximum load that can be
allowed to drop. Let Z = 10%. If a contingency is of 20% generation
loss, the governor has range from 20% to 30% load drop to allow.
Rule 5.2 Safety Policy
Rules 1. Every area has the total load P(t) that can be dropped over time so that
system remains in a safe state (the frequency never crosses the over-
frequency threshold). P(t) is computed based on the power demand and
supply in the area Y, through sensitivity analysis in the Power World
simulator.
2. Compute the total load curtailed as 𝐷 (𝑡 ) = ∑ 𝐿𝑥
𝑥 . If D(t) < P(t),
ALLOW else DISALLOW.
Note: P(t) value for a time period is computed by the RTDS in the real
world. In this thesis, we compute P(t) via Power World sensitivity anslysis
(described in section 6.2).
Command 6 LOAD_SHEDDING (X% of customers in area Y to shed load, L MW
load to be curtailed and signed by PU governor)
Rule 6.1 Required Policy
Rules 1. If the generation loss happens in some area and load shedding needs to
be performed in the area Y, the command must be issued by the PU;
jump to step 2. If the generation loss happens in the area Y where
commands are issued, jump to step 3.
2. PU issued signed command to the DRAS Y. DRAS Y governor will
verify the signature of the PU governor for the recent timestamp and
verify the total load needs to be shed using the signed information. If
it holds, ALLOW else DISALLOW.
3. If there is no contingency in the same area, discard commands. If there
is a contingency in the area, it will compare the generation loss plus
Z% with the total percentage of load to be dropped.
Rule 6.2 Safety Policy
Rules 1. Governor has the information about the load of X% customers in the
area Y; 𝐷 (𝑡 ) = ∑ 𝐿𝑥
𝑥 . Verify whether the total load shed by
disconnecting X% customers in the area Y for a given time period with
P(t).
2. If D(t) < P(t), ALLOW else DISALLOW
138
Command 7 Customer Policy for DRAS Governor
Rules 1. DRAS governor checks whether customers have enrolled in the DR
program.
2. Customers to whom commands are sent should not be hospitals or
critical infrastructure where power supply is required all the time.
3. If customers have not enrolled in the DR program and commands are
sent to critical infrastructure, DISALLOW; otherwise, ALLOW.
Malicious Load Shedding Command Example. Suppose an adversary has already
compromised a DR cyber system in the PU. She sends malicious load shedding command to
DRAS Y to shed L MW of the load from the area Y. In the absence of the PU governor,
commands are delivered to the DRAS Y. DRAS Y computes the total number of customers to
send load shedding commands based on the power consumption of each customer. Again, in
the absence of the DRAS governor, the commands are delivered to all customers. The
customers are disconnected from the grid and load drop happens, which can make the grid
unstable.
Now consider the case when PU and DRAS governor are present. The command goes
through the PU governor and it evaluates whether this command is required or not according
to Rule 2.1. PU governor verifies whether there is any contingency in the system that led to
reduced generation in some areas or there is a hike in demand. If it is required, ALLOW. It
decides using the rotating policy to which DRAS to send the command. It signs the command
and sends it to the DRAS Y. Now DRAS Y computes the total number of customers to send
load shedding command in its area based on the total load from each customer (assuming each
customer has the same load requirement). The commands are captured by the governor and it
evaluates for all customers Rule 7, required Rule 6.1 and safety Rule 6.2 policy simultaneously.
If any policy says NO, the commands are discarded. See figure 5.14 for the flow chart of the
load shedding command.
139
Figure 5.14: Load Shedding Governor Rule flow chart.
Figure 5.15: IEEE 9-Bus System for Governor Experiment.
140
5.3.5 Experiment Demonstration
In this section, we demonstrate the effectiveness of the IGNORE system for preventing
load altering attacks through compromised DRAS or PU cyber system. Our results are based
on computer simulations. First, we explain the physical system modeling. Second, we explain
the simulation methodology and attack use-cases. Finally, we demonstrate the effectiveness of
the DR governor by providing simulation analysis.
Power Grid Simulation Setup. We use the academic version of the Power World
simulator to perform a transient stability analysis of the grid. We use IEEE 9-bus system [126]
to represent the power grid because it is frequently been used as a benchmark in the industry
to perform transient stability analysis. It represents a simple approximation of the Western
System Coordinating Council (WSCC) with 9 buses and three generators. We divide the 9-bus
grid into different areas controlled by DRAS shown in Figure 5.15. DRAS 1 is responsible for
zip-code 1 (see Figure 5.12) which are modeled as the load on bus 5. DRAS 2 is responsible
for zip-code 2 which are modeled as the load on bus 8. Finally, DRAS 3 is responsible for zip
code 3 which are modeled as the load on bus 6. Bus 1 is a slack bus for providing base power
supply to the system. Generators on bus 2, 3 are responsible for providing power to specific
zip-codes in which they are present and to zipcodes modeled as the load on bus 8.
We model zipcodes using zip code-90057 which is a highly populated neighborhood in
central Los Angeles [127]. We know the zip code-90057 power consumption data from
industrial, commercial and residential customers. The values of percentages and average load
in kilowatts (kW) were derived from average numbers for the entire Los Angeles Department
of Water and Power (LADWP) service area and it is reported to the US Energy Information
Administration.
141
Table 5.4: Zip Code-90057 Los Angeles Neighboourhood model for smart meters and load
distribution [127].
90057 Neighborhood Energy Customer Model
Customer Type Percent (%) Number of Meters Average Load (kW)
Industrial (I) 0.50 2 19.17
Commercial (C) 12.20 49 8.51
Residential (R) 87.30 349 0.67
Total 100 400 689.16
Table 5.5: Approximate Number of Houses, Commerical and Industrial Customers in each
zipcode that are responsible for load on buses in 9-Bus System.
DRAS Load Bus ZipCode #NH I C R
1 5 1 181 362 8869 63169
2 8 2 145 290 7105 50605
3 6 3 130 260 6370 45370
Figure 5.16. P(t): Maximum load that can be dropped. This is the result of sensitivity
analysis.
142
We use Table 5.4 neighborhood data to model all zip-codes. Consider bus 5 of IEEE 9-
bus system which has the load of 125 MW and modeled as zip-code 1. According to Table 5.4,
one neighborhood has an average load of 689.16 kW. Therefore, the total number of
neighborhoods in zip-code 1 is 181 approximately. Table 5.5 represents the number of
neighborhoods (#NH) modeled on each load bus. Each meter is modeled as a residential,
commercial or industrial customer in a neighborhood. Since one neighborhood contains 349
houses (Table 5.4), 181 neighborhood contains 63,169 houses. The number of houses,
commercial and industrial customers in each DRAS region is shown in Table 5.5. It represents
that load on the bus 5 (zip-code 1) is produced by 181 neighbor-hoods, which contains 63,169
residential, 8,869 commercial and 362 industrial customers.
Safety Policy for Load Shedding and Curtailment. The safety policy of the load
shedding or curtailment commands, governor compares the percentage of the load to be
curtailed or shed, that means load drop, with the maximum percentage of the total load that can
be dropped before system destabilizes. Figure 5.16 represents the boundary of power quality
violation that shows if the percentage of load drop happens over a certain duration of time goes
beyond the boundary, the area will not be in a safe region. We use this as the safety policy. The
graph changes over time-based on the power demand and power supply. The governor must
compute and update p(t) graph based on the state information. In real world power system,
RTDS must be used since it is a fast real time power simulation tool to perform sensitivity
analysis.
Simulation Methodology. The total demand in the IEEE 9-bus model is 315 MW
which is satisfied by three generators by supplying 320 MW. For experiment purposes, we use
the percentage of these values for the generation loss or load drop. Suppose 70% of generation
loss happens in area X and utility performs load shedding to shedd 40% of the load in the same
143
area. In order to model this scenario, we first reduce 70% of generation from Gen 2 and then
reduce 40% of the load from Bus 5. By doing this, we perform transient stability analysis and
plot the frequency to understand whether the system is stable or not. We demonstrate the
stability of the system in the presence of load shedding attack in two cases when: 1) governor
is not present, and 2) governor is present. We run every simulation for 300 seconds and plot
area frequency curve.
Base Case. In order to show how the frequency of the system behaves in the presence
of a contingency as a base case, we simulate the generation loss and load shedding performed
by the PU in the absence of the governor and attacks. The initial generation from the generators
in the WSCC model at Bus 10 and Bus 11 is 163 MW and 83 MW respectively.
Suppose during a hot summer day the power demand is at its peak and system must
satisfy the demand by providing a relevant generation. If there is a sudden increase in load,
more power will flow through lines may result in line overloads and ultimately, line tripping.
It is because power flow is computed according to Kirchhoff’s law, which does not have any
capacity constraint. Once line starts tripping, lines connecting the generators to the grid will
trip and thus generation loss happens. To model this scenario, we reduce the generation on Bus
10 by 20% at the 50s, 50% at 55s and Bus 11 by 45% at 60s. And after some time, we restore
the generation of the Bus 10 by 45% at 200s and Bus 11 by 40% at 210s. Figure 5.17 shows
how frequency deviates from its normal state (60 Hz) moves to contingent state and finally,
goes beyond under-frequency threshold 57.60 Hz.
In the above case, PU does not perform load shedding to bring frequency to the normal
state. Let’s see what happens when utility performs the load shedding in the above simulation.
PU knows about the generation loss, it sends load shedding command to DRAS 1 and 2 to
reduce the load by 40% in each area. DRAS 1 and 2 receive load shedding command and
144
compute the number of meters to send load shedding command using Table 5.4 and 5.5. DRAS
1 is responsible for controlling load at Bus 5 that is 125 MW and 40% of load means 50 MW.
Similarly, DRAS 2 is responsible for Bus 8 and it must shed 40 MW of load. Suppose DRAS
1 shed 30 MW of load from residential meters and 20 MW from Commercial meters. DRAS 1
sends load shedding commands to 70.88% of residential meters and 26.49% of commercial
meters. Similarly, DRAS 2 shed 30 MW of load using residential and 10 MW from commercial
meters. Therefore, DRAS 2 sends commands to 88.48% of residential meters and 16.53% of
commercial meters.
Figure 5.18 presents the frequency curve over time for the load shedding scenario
during generation loss. The load shedding signals from DRAS 1 reached 55s and that’s why
we see an increase in frequency. Due to loss of generation on Bus 11, the frequency drops. At
65s, the load shedding commands from DRAS 2 leads to increase in frequency and prevents
frequency to drop further. And after a certain duration, when generation is restored, the
frequency goes back to normal. Load shedding helps the power system to maintain its
frequency within safe limits. Thus, preventing lines and generators from tripping.
Figure 5.17: Base Case - Generation Loss.
145
Figure 5.18: Base Case - Generation Loss and Load Shedding.
Figure 5.19: Load Shedding Attack- Within safe frequency limits.
Figure 5.20: Load Shedding Attack- Beyond safe frequency limits of 61.8 Hz
146
Figure 5.21: Load Shedding Attack - With Governor
Load Shedding Attack – Without Governor. In this scenario, we demonstrate how
an adversary performs load shedding attack either to destabilize the system or to increase the
cost of the utility in terms of customer dis-satisfaction.
Suppose in DRAS 1 area, 20% of generation loss (32.6 MW) happens and PU sends
load shedding signals to reduce 30% of power consumption on load Bus 5. An adversary
leverages generation loss information from the PU website and performs irrelevant load
shedding. In the first case, an adversary sends load shedding signals from compromised DRAS
1,2,3 to 70.88% of residential meters and 43.06% of commercial meters in DRAS 1, 44.41%
in DRAS 2 and 58.98% in DRAS 3 to reduce 20% (additional load), 15% and 20% of the load
respectively. Figure 5.19 shows when generation loss happens at 85s, area frequency drops.
But relevant and irrelevant load shedding signals make frequency to increase above the normal
value but within the over-frequency threshold that is 61.8 Hz. In this case, customers were
forced to shed load when it was not required. They will get dis-satisfied by this service and
start sending requests for the power outage.
147
Similarly, in the second case, an adversary sends irrelevant load shedding commands
from compromised DRAS 1,2,3 to reduce 20% (additional load), 45% and 60% of the load
respectively with the main motive to destabilize the system. DRAS 1 recovers load from
70.88% of residential meters and 43.06% from commercial meters in its area. DRAS 2 recovers
30 MW from 98.96% residential meters and 10.5 MW from 19.36% commercial meters. And
DRAS 3 recovers 30 MW from 88.48% residential meters and 30 MW from 49.61%
commercial meters. Figure 5.20 shows when generation loss happens at 85s, area frequency
drops. But shedding commands make frequency to increase beyond the over-frequency
threshold that is 61.8 Hz at 110s. Once the area frequency goes beyond the secure threshold,
generators will trip and generation loss happens.
Load Shedding Attack – With Governor. Suppose in DRAS 1 area, 20% of
generation loss (32.6 MW) happens and DRAS 1 sends load shedding signals to reduce 30%
of the load on Bus 5. Since the command is not originated from the PU, it might be possible
that generation loss happens in the same area where the command is issued. An adversary
leverages the generation loss information from the PU website [36] and from compromised
DRAS 1, 2 and 3, they send load shedding commands on Bus 5, Bus 6 and 8 to reduce the load
by 20%, 15%, and 20% respectively. The main purpose of the attacker is to disconnect
customers from the grid and increase the cost of the utility in terms of customer dissatisfaction
and customer reconnection. DRAS 1,2,3 governors provide local safety by verifying safety and
requirement policy described by Rule 6.
DRAS 1: requirement policy, it checks whether there is any contingency in area 1, if
yes, the total load shed should be in the range of total generation loss 20% to 30% according
to Rule 6.1. In this case, 50% of the load drop commands are issued on Bus 5 in response to
20% of generation loss in area 1; it will only allow 30% load to drop and allow commands
148
accordingly and rest it will discard. In the simulation, as admin, we issue 30% load drop
commands, but in reality, the admin will issue commands that she thinks is the incorrect
response for the system and accordingly she will adjust Z value. DRAS 1 safety policy, it
checks the total load to be dropped in this case is within limits which are under the safety policy
according to Figure 5.16. Since the 30% load drop is below the threshold, it will allow. DRAS
2 and 3: since there is no generation loss in those areas and commands are not originated from
PU, DRAS 2 and 3 governor will discard those commands. Figure 5.21 presents the area
frequency curve where attack commands are blocked by the governor, wherein DRAS 1
relevant commands are allowed (allowed by DRAS 1) and irrelevant 15% and 20% load drop
is blocked by DRAS 2 and 3 respectively.
If commands would have originated from the PU, it means, although load shedding is
happening in area 1, generation loss would have happened in another area. PU issues the
command to the DRAS 1. PU governor would verify and sign the command using Rule 2.1.
The commands will reach DRAS 1 where it computes the total numbers of meters in its area
to send command. DRAS 1 governor needs to verify whether these commands are required or
not. It verifies the sign of the PU governor and the total load needs to be shed using Rule 6.1.
if commands are valid, it verifies whether it is safe to execute commands using safety property
Rule 6.2. It will verify whether the total load to be shed is under the maximum load that can be
shed in the given time. If yes, it will allow commands (see Figure 5.19, assuming commands
were not attacking commands). If the governor is not present, the grid destabilizes in Figure
5.20; otherwise, the result will be similar to Figure 5.21.
5.3.6 Analysis of Special Cases
In this section, we show how governor prevent area wise attacks discussed in section
5.3.1, and how to upgrade governor rule for two special cases: 1) when a critical infrastructure
149
in an area requires power supply irrespective of whether other area is stable or not, and 2) how
to prevent load shedding attack where attacker reduce the load shedding command to cause
overloading instead of performing irrelevant shedding.
Area wise attacks. We assume that generation loss contingency (GLC) occurs and the
governor prevents load shedding attack (LSA) in all cases. Cases are represented by Attack
Number (CRA).
Attack 1 (X;X;X): An adversary performs LSA in the same area X where GLC happens,
and DRAS X system responds to mitigate the contingency. DRAS X governor verifies the
command issued by the DRAS X according to Rule 6 since it knows about the state information
(and GLC) and confirms whether this command is required and safe for the local system X.
Attack 2 (X;X;Y): GLC happens in area X, the system responds in the same area, but an
adversary performs LSA in the area Y. If compromised DRAS Y issues LSA commands, it
will be detected by the DRAS Y governor since there is no contingency in the area Y, and
therefore, it will disallow commands and notifies IDS. If the PU issues command to DRAS Y,
PU governor will detect this malicious command since it knows DRAS X governor has already
taken action in X and there is no need to perform any further action. If DRAS X cannot mitigate
GLC, it would ask PU to take a specific action, and that request will go through DRAS X
governor. Finally, DRAS X governor would have verified the request and forwarded to PU and
PU governor. PU command will be checked according to Rule 2.
Attack 3 (X;X;X,Y): GLC happens in area X, the system responds in the same area, but
an adversary performs LSA in both area X and Y. When an adversary issues LSA commands
in X, it will be detected by the DRAS X governor according to Rule 6. In the case of Y, LSA
commands are verified similarly as in the case of Attack 2.
150
Attack 4 (X;Y;X): GLC happens in area X; the system responds in different area Y. This
is the case when DRAS X requests PU to perform load shedding since it cannot mitigate GLC.
DRAS X issues the request, and it goes through DRAS X governor which it will verify and
send it to the PU and PU governor for the state information update. PU will issue commands
to DRAS Y, and it will be verified by the PU governor who knows that the contingency is to
be mitigated by issuing commands in the area Y. But LSA happens in X. When DRAS X issues
a malicious command, it will be captured by the DRAS X governor, and it discards because it
knows DRAS X has issued a request to PU.
One of the limitations of the governor is that if attacker performs LSA attack first in
case of GLC before DRAS X issues command to the PU, the DRAS X governor will allow
commands if they are safe for the system according to Rule 6. In any case, the governor will
not allow the destabilizing system command. PU must anticipate the future generation loss and
send state information to the DRASs so that they can take actions quickly.
Attack 5 (X;Y;Y): GLC happens in area X, the system responds in area Y where LSA
happens. PU issues command to DRAS Y on the request of the DRAS X, and the PU governor
verifies them. DRAS Y governor confirms whether the commands (both malicious and
legitimate) is required based on the signature of the PU and safe for the local system.
Attack 6 (X;X,Y;X,Y): Suppose 20% GLC occurs in X, systems responds in X,Y and an
adversary performs LSA in X,Y. DRAS X decides to perform 20% LS in X and request PU to
perform 10% LS in Y. In X, if an adversary adds Z% of extra LS command, it will be verified
by the DRAS X governor. Moreover, it will check the request that is sent to the PU and PU
governor. In Y, PU issues command to the DRAS Y through PU governor. PU governor will
verify the percentage of the request based on its state information. And if attacker performs
attack through DRAS Y, it will be verified by the DRAS Y governor since it knows how the
151
percentage exact percentage of load needs to be shed based on the state information received
from the PU governor.
Critical Infrastructure Scenario. Suppose a generation loss happens in the area X
where a critical infrastructure, such as hospital, is present, and DRAS X cannot supply enough
power to the hospital by shedding load in just area X. So, it requests PU to perform shedding
in one of the adjacent areas because it is necessary to supply power to the hospital in all
scenarios. We assume that the hospital has already used its power backup storage, and DRAS
X does not have enough storage to satisfy hospital demand. PU performs load shedding in
another area Y. If the percentage of the load to be shed is within threshold limits, it will allow
commands to go through according to Rule 2 and Rule 6. But if it is more than the threshold
limit, it will not allow. We need to update the rule for this scenario. In such cases, it will
disconnect the area Y from the rest of the grid and perform load shedding so that over frequency
changes should not cascade to the rest of the grid. Thus, it will be able to shed load and supply
power to the hospital.
Reduce Load Shedding Attack. Suppose in DRAS 1 area, 20% of generation loss
(32.6 MW) happens, and to maintain frequency above under-frequency threshold, DRAS 1
must send load shedding signals to reduce 30% of the load on Bus 5. Since we have assumed
that DRAS 1 has been compromised, an adversary reduces the percentage of load shedding to
be performed in the region to cause overloading. With the current governor rules, the governor
cannot detect this attack, and it will allow commands to go through. We need to update rule for
this scenario. For the percentage of contingency, in real world, RTDS will simulate the
minimum percentage of load L(t) to be shed so that frequency is above the under-frequency
threshold. In the safety rule, the governor checks whether the percentage of the load to be
152
dropped is within the lower bound L(t) and upper bound P(t). If yes, it will allow commands
otherwise, notifies the IDS. Similarly, we can update rule for the load curtailment command.
5.3.7 Countermeasures
When the governor detects malicious or unwanted commands, it will notify IDS so that
it can take countermeasures to prevent further attacks. First, they should disconnect the area
from the rest of the grid so that the attack should not propagate to neighboring regions. Second,
perform a vulnerability assessment of the cyber system of that region to understand the
vulnerability status. Finally, prioritize cyber systems for patch management. Simultaneously,
deploy resources, such as storage, renewable resources, etc. to bring back the frequency of the
area to its nominal value.
5.3.8 Limitations
This study has some limitations in terms of what kind of attacks are not prevented by
the governor, and potential new threats added to the grid.
Attacks not Prevented. Governor does not prevent DDoS attacks, false data injection
attacks, and privacy attacks. An adversary can perform a DDoS attack on the communication
infrastructure to prevent relevant load shedding commands to reach customers causing the
frequency to drop beyond the under-frequency threshold. In case of false data injection attacks,
if attackers modify commands between DRAS governor to customers, the governor will not
detect those changes since commands have already left the governor. If commands are
modified anytime between PU to DRAS or from DRAS to DRAS governor, the governor will
detect it.
153
Potential threats Added. The potential threats added when an adversary compromises
a governor. There will be no point of verification, and all malicious commands will be executed
in the system. Governor can be used to make a malicious request to the PU. Moreover, a
malicious governor can send incorrect information to the PU governor, which will make use of
this information in making further decisions for different areas. Therefore, we need to protect
the governor from getting compromised, and we describe some measures in section 5.3.8.
General Limitations. We used the WSCC IEEE 9-bus model to demonstrate attacks
on the power system, assuming attackers have already compromised the cyber network. The
amplitude of presented attacks may not reflect the real world scenarios but represents how
attacks can be conducted on the physical system. Power engineers must simulate such scenarios
with and without governor considering the details of both cyber and power system. Moreover,
we do not consider power storage such as batteries used by the utility to satisfy demand in the
case of generation loss in some areas.
5.3.9 Governor Protection
The question arises: what if a governor is compromised? As one would do with a
security kernel in a high assurance system, the governor should be implemented with economy
of mechanisms to provide a smaller attack suface than exists for other modules in a
system. Additionally, trusted computing technologies may be leveraged to provide stronger
assurance of the integrity of the governor itself. If the total code used to implement a governor
is minimized it may also prevent itself to the application of more formal code analysis
techniques.
A set of rules governor implements must either be physically installed on the device or
use VPN to connect to the device to perform an update. It should be updated at regular intervals
154
to avoid any vulnerabilities. In no case, a governor is allowed to communicate with any other
device. The system to control governors must be separate from the DRAS and PU so that if
they are compromised, the governor system must independently perform its functions.
Governor must have designated ports to accept the state information from aggregators,
power units, and PU governor. It does not communicate on any different port and discard all
unknown connections. In case of any contingency, it will notify the IDS about the irrelevant
commands. To provide scalability and fault tolerance, multiple replica servers of the governor
system should be implemented. The scalability is essential to verify a large number of
commands at the DRAS governor. To discuss how to update software in multiple replicas is
beyond the scope of this thesis.
5.4 Conclusion
We presented the methodology to design a governor to prevent attacks on a higher-
level function of a CPS, and empirically demonstrated the effectiveness of the approach by
designing the governor system for the DR functionality. DR governor prevents attacks that are
originated from the compromised DRAS or PU and try to manipulate power demand and
supply. Governor verifies the requirement and safety policy of all commands that are issued
from PU and DRAS. We show how DR governor prevents attacks that destabilize the grid or
increase the cost of the admin in terms of customer dissatisfaction, attacks that are targeted on
the circuit breakers and DER units. Through governor, it is possible to prevent those attacks
where commands are issued from a compromised cyber domain that is responsible for
controling a functionality and end devices are not directly controlled. By implementing a
governor, we are reducing the attack surface of a function.
155
We discussed where to place the governor, from where it receives the state information,
how it will compute the safety policy and how rules are applied. We recommend power
engineers to carefully choose the Z value based on their learning of the system over time so
that governor should at least allow a relevant percentage of commands to go through so that
system is in a stable state. We hope our work educates power engineers about how to prevent
a cyber attack from propagating to the physical domain (containment stage of resilience) to
make it secure and resilient against CPAs. This work sheds light upon how higher-level
functions are protected by analyzing the cyber and physical aspects of the system.
156
Chapter 6
Power Storage Protection Framework
In this chapter, we present a framework that assists power engineers in deciding what
actions, such as decreasing or increasing power reserve and power dispatch, performing load
curtailment or shedding, or repair some physical nodes/links, they should take in order to
recover from CPAs at a minimum operating cost. This chapter focuses on the recovery phase
of resilience.
Our contribution is twofold. First, we describe the different types of attacks on energy
storage units and their impact on smart grid resilience. Second, we discuss the formulation of
the Power Storage Protection (PSP) framework against a fixed opponent (adversary). We fix
the strategy for an adversary and model the problem as a Partially Observable Markov Decision
Process (POMDP) from the perspective of the defender (power utility) [107] and solve it using
Increment pruning method [106] using POMDP solver [105]. The model decides what actions
to perform in order to minimize the cost of operation and maintain power system stability
against the actions of an adversary. We provide a theoretical framework for formulating the
above problem and demonstrate its effectiveness through emperical results using a simplified
PSP scenario.
6.1 Attacks on Power Storage
In this section, we describe possible cyber attacks on the energy storage systems and
their impact on the smart grid resilience.
157
Batteries to store power. Power utilities pay power vendors to reserve power in
batteries. These batteries are connected to the transmission and distribution infrastructure. An
adversary performs topology and DoS attacks on the communication infrastructure that is
responsible for sending messages for power dispatch. Furthermore, they compromise the
communication protocols such as SNMP or Modbus [108] to modify the data flowing through
the network between the power utility and batteries. An attack on batteries will prevent power
dispatch during contingency (power shortage to meet demand) and thus, destabilize the grid.
Distributed Energy Resource (DER). DER is a smaller power sources that generate,
store, and dispatch power once signaled from the utility. DER fulfills the need for power during
a contingency. Since it is present in consumers facility, consumers can produce and sell power
to the grid. This requires frequent communication between the DER at the customer's facility
and grid operators. Since a DER is accessible over the network, it is vulnerable to CPAs. If an
adversary controls a set of DERs, she can disable them to prevent power dispatch so that there
is a demand-supply mismatch and thus degrades the grid’s resilience. She can perform variety
of attacks on DER/AMI infrastructure such as malware propagation attack to compromise
smart meters to control DERs, a DDoS attack on the communication infrastructure to prevent
power dispatch signals to reach DER, etc.
Natural Gas Storage for Peaker Plant. It generates power during peak hours to meet
peak demand. Natural gas is delivered to peaker plant via low-pressure distribution pipelines.
If an adversary performs an attack on the natural gas pipeline [29], this affects the gas delivery
to a peaker plant, resulting in loss of power generation during peak hours and causing power
demand-supply mismatch.
158
6.2 Power Storage Protection (PSP) Framework
PSP is an infinite horizon two-player zero-sum Partially Observable Stochastic Game
(zs-POSG) with one-sided, partial observability. We fix the attack strategy of the attacker and
model the problem as a POMDP [105] from the perspective of the defender. The main idea
behind this approach is that the defender does not perform any action if there is no contingency.
If the system is working fine, the defender performs NO ACTION.
The model is partially observable because the defender does not know about the true
state of the system. She has no idea what attacks happen. The game is zero-sum because an
adversary performs attacks that destabilize the power grid and maximize defender's cost. For
each attacker action, defender performs an action which incurs some cost. On the other hand,
the defender wants to decide which action to take at every time step that stabilizes the grid and
minimizes the cost.
PSP POMDP model is defined by a tuple = (I, S, B, A, , R, b0, , O, , γ).
Nomenclature:
• I = {Id, Ia} is the set of agents.
• S is the finite set of world states.
• B: Δ(S) is the probability distribution over S.
• A is the finite set of actions (Ad, Aa) of the agents.
• b0 ∈ Δ(S): An initial belief of the game.
• is the finite set of observations.
• O is the conditional observation probabilities.
• is the fixed stochastic strategy of the attacker.
• T: S A → S is the state transition function.
159
• R is reward function R: S A → for agent Id.
• γ ∈ [0,1] is a discount factor.
Suppose s S is the current state of the system at time step t. In s, the defender takes
action ai ∈ Ad, and the attacker takes action aj ∈ Aa according to the fixed stochastic policy j
= p(aj | s, ai). The game moves to new state s/ ∈ S according to a stochastic joint transition
model p(s/ | s, ai, aj) in time step t+1. Since we know the policy of the attacker (fixed), we
compute a single transition model T(s/ | s, ai) for the defender:
p(s/ | s, ai) = aj p(s/ | s, ai, aj) p(aj | s, ai) (1)
Using this equation, we model the game as POMDP from the perspective of the
defender [107]. The defender receives an observation o ∈ with probability O(o | s/, ai) = p(o
| s/, ai) as the game moves to a new state s/. Since the strategy of the attacker is fixed, we do not
care about the attacker’s observation. The defender receives a reward (cost) of −Rt (s, ai) for
this transition and the attacker receives Rt (s, ai) because this is a zero-sum game. The reward
may depend on the previous state of the game, and the joint action (eq. 2). It is also possible
that rewards just depend on the state of the system (eq. 3).
Rt (s, ai) = aj r(s, ai, aj) p(aj | s, ai) (2)
Rt (s, ai) = r (s) (3)
The critical assumption of POMDP is that the world states (S) are not fully observable,
and therefore, the concept of belief state is introduced (which is the probability distribution
over world states S). That is how we transform POMDP into Belief-MDP where transition,
observation, and reward functions are over belief space. Initially, the defender has a
160
fundamental belief of b0. The belief gets updated at every time step based on the action and
observation pair:
b(s/) = p(o | s/, ai) s p(s/ | s, ai) b(s) (4)
= 1 / p(o | b, ai) (5)
p(o | b, ai) = s/ p(o | s/, ai) s p(s/ | s, ai) b(s) (6)
where is the normalizing constant. The transition function from belief state b to b/ when
defender takes action ai ∈ Ad:
T(b, ai, b/) = o p(b/ | b, ai, o) p(o | b, ai) (7)
where p(b/ | b, ai, o) = 1 if belief update with arguments b, ai, o returns b/, otherwise 0. And the
reward function of taking action ai ∈ Ad in belief state b:
R(b, ai) = s Rt (s, ai) b(s) (8)
V(b) = maxai Ad [ R(b, ai)] + γ b/ T(b, ai, b/) V(b/) ] (9)
The main goal of PSP is to find the sequence actions which will maximize the expected
rewards for the defender for each belief. The value function for each belief is represented by
eq. 9. We will discuss how to solve PSP POMDP in section 6.3.2. In the following sub-sections,
we describe the domain and problem statement in detail.
161
Figure 6.1: System State
6.2.1 Agents
There are two agents (I) in this game. One is an adversary Ia (a cyber hacker), and
second is defender Id (power utility). An adversary may be an insider (who wants to take
revenge), state-sponsored, or terrorist hackers. The main motive of an adversary is to reduce
the resilience of the grid by attacking the cyber and physical infrastructure. The motive of the
defender is stabilize the grid by taking actions in the cyber and physical domain at minimum
operating cost.
6.2.2 System State
The state (S) of the system is represented in the form of two directed graphs. Graph G1:
represents the power distribution network (see Figure 6.1 upper portion). The nodes are power
reserve (PR), power distribution (PD), and power sink nodes. PR nodes store a certain amount
of power to meet unexpected demand. PU pays some cost to maintain PR. PR nodes are
connected to PD nodes, and finally, PD nodes are connected to the client (C) nodes, which
consume power. Here, we abstract a particular zip code to have the power demand. The edges
between these nodes represent the power flow from PR to PD to C. Each edge (Ep) has power
capacity (Ce) and the amount of power actually flowing through it (Ue). The conservation of
power flow is followed by each node and edge in the graph. The amount of power going inside
162
the PD node is equal to the amount of power going out of it. The power flow through all edges
follows Ue Ce. The client nodes have certain power demand (always consume power, edges
go into them) that may change over time. The power reserve nodes are power sources that
always produce power (edges go out of them).
Graph G2: represents the information flow between the client nodes and PU (see Figure
6.1 lower portion). How a PU figures out demand in a particular region? Based on the AMI,
the information about client power consumption and demand is sent to the PU. Based on the
readings, the PU decides what action to perform to meet the demand and maintain system
resilience. We have three types of nodes in the graph. One is PU to gather information, second
is routers, intermediate nodes to transmit information, and finally client nodes to send power
reading. Over this network, PU sends commands to clients to perform load shedding,
curtailment, etc. The edges (EI) represent the information flow from PU to C and vice versa.
The edge in the information network can either be ACTIVE or INACTIVE and nodes in the
information network can either be MALICIOUS or NON-MALICIOUS based on the type of
attack performed by an adversary.
6.2.3 Actions
From a defender's point of view, there are two categories of actions: Cyber Actions and
Physical Actions. The cyber actions are performed from the cyber domain. It is further divided
into two categories: Cyber-Cyber and Cyber-Physical Protection. The cyber-cyber actions tend
to protect the cyber components from cyber-attacks. For instance, vulnerability assessment and
patching vulnerabilities prevent a system from getting compromised by an adversary. The
cyber-physical protection actions are taken from the cyber domain on the physical
infrastructure so that the grid continues to meet the power demand. For instance, load
163
curtailment or load shedding commands are sent from the cyber domain via demand response
mechanism to reduce the amount of power consumption at the customer end.
The physical actions consist of actions that are performed in the physical domain. For
instance, repairing a compromised node, link in the infrastructure, manually updating the
software of a Programmable Logic Controller (PLC), etc. Physical actions are costlier than
cyber actions because the defender must send a technician to repair parts of the network and it
also takes time. In this thesis, we are concerned about the cyber-physical and physical actions
of the defender. We do not consider the cyber-cyber actions (such as Patching or scanning).
The action set for the defender is:
AD : {Cyber-Physical Actions: No-Action, Increase Power Reserve, Reduce Power
Reserve, Load Curtailment, Load Shedding; Physical Actions: Repair Node, Repair Link}.
We have also included No-Action for the defender in case the system is in a consistent
state with no action performed. The adversary acts by performing attacks on the information
network. The adversary can perform a variety of attacks such as malware injection in PLCs or
AMI meters, data integrity, etc. with a motive to prevent correct power demand information
reaching to the defender. There are three categories of attacks that can be performed:
AA : {Topology attacks, Integrity attacks and Hijacking}.
The topology attacks include the removal of a node or link in the information network.
It prevents power consumption information from reaching the utility. Integrity attacks alter the
readings from the node (meters). This is much harder to detect, and hence a defender may be
deceived. The integrity attack is performed by hijacking a node or link where the attacker issues
fake commands of actions available to the defender, such as load curtailment. Each action
164
performed by the agents has some cost involved and provides a reward. We will discuss the
reward and cost structure in the following sections.
6.2.4 State Transition and Observations
The state transition is determined by the choice of the defender actions (which exhibit
a deterministic response) and attacker actions (which also exhibit a deterministic response but
occur stochastically). The state of the system is represented in terms of two graphs G1 (power
graph) and G2 (information graph). Agents perform actions described in the previous sub-
section. The defender acts first in a particular belief followed by the attacker's action. The
system moves to a new belief state, and defender receives some observation. The defender
could receive the following observations:
OD: {Node failure, Link failure, Node and Link failure, Area Frequency goes up, Area
Frequency goes down, Attack on AGC, Attack on DER, Attack on AMI}.
Consider Figure 6.2 for state transition example. Suppose the system is in state s1: Node
Disable, where an attacker has already disabled the wireless node in the communication
network. The main motive of the defender is to repair this node to maintain state awareness
and send signals to consumers. The defender performs action d1: Repair node. The attacker
performs action a2: Load Drop attack with some probability p(a2 | s1, d1). Due to this, the
frequency increases. Note: normally, the area frequency limit is 60 Hz. The frequency
protection is enabled with a threshold of 61.00 Hz and under frequency is 59.00 Hz with a
pickup time of 2 seconds. If the frequency in the system exceeds 61.00 Hz for more than 2
seconds, generators will trip in response to over frequency protection mechanism and the same
in under frequency case. The over frequency happens when there is more generation than load
and under frequency when there is a lower generation than load.
165
Figure 6.2: State Transition
The system moves to a new state s2: Load Dropped. The defender will observe either
of two observations: o1: Frequency increase or o2: Node Disable and Frequency increase.
Finally, defender incurs the cost because of load drop attack. Based on the observation, the
defender performs an action and system continues to move in the new state. In section 6.3, we
describe this example in detail for experiment.
6.2.5 Payoffs
We define the payoff of the defender at every time step t. The defender problem is the
multi-objective optimization problem. The defender needs to stabilize the grid by taking
actions to fix attacks at a minimum cost. The defender payoffs depend on the number of the
following factors.
Distance from the ground truth Power Demand: The main function of the defender
is to meet power demand at each time step with minimum operating cost. Suppose the
minimum power storage is D is always maintained by the defender. There always exist a
ground truth power demand at each time step t, pdt, which defines the total amount of power
required to the customers c:
166
pdt = c, t pdc (10)
where c C is the number of customers nodes present in the power network G1. The defender
would like to provide the ground truth power demand at the lowest cost possible. The distance
from the above ground truth at time t is calculated by taking its difference from the total amount
of power reserve available with the defender.
prt = r, t prc (11)
where r R is the number of power reserve nodes present in the power network. The difference
at time t is prt - pdt. The positive difference depicts excess power storage and negative
difference means there is shortage of power shortage. The main purpose of the defender is to
minimize the mod value of the above difference, i.e.,
| prt - pdt | (12)
→ is threshold value up to which power frequency is maintained in the grid and it is stable.
If power shortage crosses this threshold, generators will trip due to under frequency protection
mechanism. If it is maintained, the defender will incur a cost of storing more power. We include
a term Cper unit, which denotes the cost of reserve per unit. It is positive when there is excess
power, otherwise zero. Therefore, the distance from the ground truth is defined by | prt - pdt |
+ Cper unit.
Cost of Repairing Action ( CR): If defender performs repairing of an INACTIVE or
MALICIOUS node, there is inherent cost involved with it. is Boolean variable to decide
whether repair action has been taken or not. And CR is the fixed cost for repairing.
167
Number of Inactive or Malicious links (NIM): The payoff for defender also depends
on the percentage of nodes and link that are MALICIOUS and INACTIVE.
Payoff defender is represented as Dt:
Dt = int ( | prt - pdt | + Cper unit + CR + NIM) (13)
where int represents the normalization of the value. For instance, distance is divided by the
maximum distance. The main motive of the defender is to take actions so to minimize Dt at
every time step given the fixed policy of the attacker.
Note, when we model the problem in the form of POMDP, we specify payoffs. The
question that arises is how to know the payoffs for different states before it occurs because
payoffs depend on the factors that are determined at each time step. We compute payoffs in the
following way. First, for the number of nodes get disabled, the states are different. For example,
s is a state when one node is disabled and s/ is one where two nodes are disabled. Moreover,
we know the cost when we know the number of nodes disabled in a state. Second, the cost of
repairing a disabled node is standard specified by the defender. Finally, the cost of storing
power reserve depends on the demand and reserve at each time step. The gap cannot be more
than otherwise, the power system will destabilize because power frequency crosses under or
overprotection threshold. So we do not compute the value of gap more than equal to . For
values less than , the defender maintains D amount of power reserve always. For the difference
from the D, we assign average cost the defender has incurred in the past. This will simplify the
model generation.
6.2.6 Assumptions
We have taken following assumptions while formulating the PSP POMDP model.
168
1. We do not have real world data so we randomly assign probabilities for state
transition, observations, attacker’s policy and payoff values in our simulation. The
probability of transition p(s/ | s, ai) is computed based on eq. 1 after assigning the
probabilities for p(s/ | s, ai, aj) and p(aj | s, ai).
2. The actions performed by the defender and attacker have deterministic response.
3. The amount of time it takes for an action to perform and take effect is not
considered.
4. It is difficult to scale the problem if you do not have real-world data because we
have to assign the probability for each transition of the system, agents actions, and
rewards. The future work is to find out ways to scale the problem with and without
data and use above payoff method to compute payoffs.
5. The power utility may have contracts with many power reserve companies and they
charge differently for power at different times of day. We have not considered this
scenario.
6.3 Experiment
In this section, we provide details of the simulation and discuss results.
6.3.1 POMDP Model
We have generated a POMDP model manually to demonstrate the concept in this
dissertation. Generating a POMDP model for power demand satisfying game requires
knowledge about possible states, actions, observations, and rewards agents receive. In real life,
the actions and observations can be derived from the experience of the defender (system admin)
and using tools such as IDS. The states of the system can be derived from the history of the
system by determining the different states of the system were in the past. The admin’s
169
experience should be used to assign cost incurred to take action to protect the system. The
fundamental belief of the system would start from the normal state because the defender will
not act if there is no attack. Consider Figure 6.1 as a test network for the simulation. We have
a SCADA node to monitor the state of the system and take actions, customer node (abstracts a
zip code) to send information about power consumption and receives commands from the PU,
PD, and PR nodes that store power in the form of batteries. We define the POMDP model in
the form POMDP file, as described in [105]. The actions of the defender and attacker are
defined in Parameter 6.1 and 6.2, respectively. The probability of an attacker taking a particular
action depends on the state and action of the defender. For simulations, we assume that the
policy has probability equal to the fraction of the number of actions available to the attacker.
A naive attacker does not know what best action to take in a state. For her, all actions are equal.
Since there are two actions available to the attacker, the probability is 0.5.
Parameter 6.1: List of Defender Actions in the test network.
Defender:
actions: d0 d1 d2
d0: nothing
d1: node-repair
d2: reduce-power-reserve-dispatch
Parameter 6.2: List of Attacker Actions in the test network.
Attacker:
actions: a0 a1
a0: node-disable
a1: load drop attack
Parameter 6.3: List of states in the test network.
states: s0 s1 s2 st
s0: normal
s1: node-removed
s2: load-drop
st: node-removed-load-drop
170
Parameter 6.4: List of observations in the test network.
o1: frequency-increases
o2: no-state-info
o3: normal-scenario
o4: no-state-info-frequency-increases
Parameter 6.5: Observation probabilities in test network
O: * : s0 : o3 O: * : s2 : o4
1.000000 0.100000
O: * : s1 : o2 O: * : st : o1
0.900000 0.100000
O: * : s1 : o4 O: * : st : o2
0.100000 0.100000
O: * : s2 : o1 O: * : st : o4
0.900000 0.800000
Parameter 6: Rewards corresponding to states in the test network.
R: * : * : s0 : * R: * : * : s2 : *
-10 10
R: * : * : s1 : * R: * : * : st : *
20 30
Parameter 7: Transitions for action d0, d1, and d2.
T: d0 : s0 : s0 T: d1 : s0 : s1 T: d2 : st : s2
1.00000 0.400000 0.400000
T: d0 : s1 : s1 T: d1 : s0 : st T: d2 : s0 : s1
1.00000 0.200000 0.500000
T: d0 : s2 : s2 T: d1 : s1 : s1 T: d2 : s1 : s1
1.00000 0.500000 0.500000
T: d0 : st : st T: d1 : s2 : st T: d2 : s2 : s1
1.000000 0.400000 0.400000
T: d1 : s2 : s2 T: d2 : s2 : s0
0.600000 0.450000
T: d1 : st : st T: d2 : st : s1
0.350000 0.400000
T: d1 : st : s2 T: d2 : st : st
0.650000 0.200000
T: d1 : s0 : s2 T: d2 : s0 : s2
0.400000 0.500000
T: d1 : s1 : s2 T: d2 : s1 : st
0.400000 0.500000
T: d1 : s1 : s0 T: d2 : s2 : s2
0.100000 0.150000
171
Parameter 8: POMDP Solver Policy Graph. N stands for Node id, A for Action, o for
observations.
N A o1 o2 o3 o4 N A o1 o2 o3 o4
0 1 14 0 16 15 9 1 12 8 16 5
1 1 14 2 16 15 10 2 15 3 16 4
2 1 14 4 16 15 11 2 14 3 16 5
3 1 12 8 16 10 12 2 12 7 16 3
4 1 12 3 16 13 13 2 15 8 16 4
5 1 14 3 16 13 14 2 12 7 16 6
6 1 12 3 16 10 15 2 12 8 16 4
7 1 12 7 16 6 16 0 12 0 16 12
8 1 12 8 16 4
Parameter 6.3 describes the list of states of the system possible if attacks happen on the
system shown in Figure 6.1. For instance, s1 is a node removed state when attacker performs
physical action: node disable. Parameter 4 describes the list of observations defender receives
when the system moves to a new state. Parameter 5 list the observations received at a state with
probabilities. The observations depend on the state of the system. Parameter 6 list the cost
defender receives in a particular state. The rewards are dependent on the state of the system.
Parameter 7 represents the transition from a state to another state on the basis of the actions
taken by the defender (one column for each action). The probability of transition is calculated
by considering the possible actions taken by the attacker in a particular state according to eq.
1. The value of p(aj | s, ai) is 0.5 for all defender actions and state.
6.3.2 Solving POMDP Model
We use POMDP solver [105] to compute optimal policy for the defender against a fixed
attacker. The solver uses a basic dynamic programming approach for the algorithms, solving
one stage at a time. It will stop solving if the answer is within a tolerable range of the infinite
horizon. The POMDP solver takes input a POMDP file of the format shown in Parameter 1 to
7. It computes the optimal value function vector coefficients and optimal policy graph (in
Parameter 8) based on the observation received by the agent using the Incremental pruning
172
algorithm[106]. The simulation runs on a machine with Intel Core i5 at 2.4 GHz and 8 GB
RAM. The solver is run without a time horizon limit and with a discount factor of 0.95. The
total time it takes to solve the POMDP model is 13.31 secs.
6.3.3 Simulation Analysis
The simulation results are in the form of value function vectors and policy graph. Each
line of the policy graph (in Parameter 8) represents one node with a unique node ID (N). It is
numbered sequentially and lining up sequentially with the value function vectors file. The node
ID is followed by action number (A), which is further followed by a list of node IDs, one for
each observation (o). This list specifies the transitions in the policy graph. The o'th number in
the list will be the index of the node that follows this one when the observation received is 'o'.
As an illustration of the optimal policies found by the POMDP solver, consider a simple
case where the defender is in some belief state b (say node ID 16 in Parameter 8). In belief
state b, he observes o3 where everything is working fine. POMDP solution recommends him
to jump to node ID 16 and perform action d0 (No-Action). So he remains in the same belief
state. If he observes o1 (frequency increases) in b, solution recommends to jump to node ID 12
and perform action d2 (reduce power reserve dispatch) so that to stabilize the frequency of the
system with threshold limits. Moreover, if defender observes o4 (no state info and frequency
decrease) in b, solver recommends d2. Note, in all node IDs when defender observes o4, half
of the time solution recommends d1 and d2 for another half. It is because o4 means node is
disabled and there is a load drop attack. In all node IDs, when defender observes o3 (normal
scenario), the belief state jumps to 16 to perform d0 (No-Action). This shows that the POMDP
model can take an effective decision that will maintain system resilience and minimize the cost
of the defender.
173
According to assumptions, we have not assigned a threshold value. In reality, if the
gap between power demand and the power reserve is more than , the solver will recommend
the same action that is to reduce the power reserve dispatch. Let us take another scenario where
power demand increases and frequency goes down, the defender either increase the power
reserve dispatch, perform load shedding, or load curtailment. If this scenario is given to the
POMDP solver, the solver will recommend an action that will stabilize the grid and minimize
the cost. We have to specify the actions load shedding and load curtailment and cost of
performing these actions in the model.
6.4 Conclusion
In this chapter, we formulated the Power Storage Protection game against a fixed naive
adversary. We fix the strategy for an adversary and model the problem as a POMDP from the
perspective of the defender (power utility) and solve it using a POMDP solver. We provide the
theoretical framework for formulating the PSP problem and provide experimental results to
support our claim using a simplified PSP game. Our experimental results show that the
defender can compute the optimal policy to recover from CPAs. We can use this model to
compute the optimal policy against different classes of attackers. The main challenge is to
compute the probabilities for transition, observation, and rewards for each state without data-
set to generate the POMDP model. Power engineers must compute these parameters based on
their system’s experience and data by carefully considering various factors discussed in this
chapter.
174
Chapter 7
Defending Oil Pipeline From Cyber-Physical Attacks
The security of critical infrastructures such as oil and gas cyber-physical systems is a
significant concern in today’s world where malicious activities are frequent like never before.
On one side, we have cybercriminals who compromise cyberinfrastructure to control physical
processes; we also have physical criminals who attack the physical infrastructure motivated to
destroy the target or to steal oil from pipelines. Unfortunately, due to limited resources and
physical dispersion, it is impossible for the system administrator to protect each target all the
time. In this chapter, we tackle the problem of cyber and physical attacks on oil pipeline
infrastructure by proposing a Stackelberg Security Game of three players: system administrator
as a leader, cyber and physical attackers as followers. The novelty of this work is that we have
formulated a real-world problem of oil stealing using a game theoretic approach. The game has
two different types of targets attacked by two distinct types of adversaries with different
motives and who can coordinate to maximize their rewards. The solution to this game assists
the system administrator of the oil pipeline cyber-physical system to allocate the cyber security
controls for the cyber targets and to assign patrol teams to the pipeline regions efficiently. This
work provides a theoretical framework for formulating and solving the above problem.
7.1 Domain Description
Pipelines are vital infrastructure for energy flow within and across nations. The U.S.
has a network of more than 185,000 miles of liquid pipelines, 320,000 miles of gas pipelines,
175
and over 2 million miles of gas distribution pipelines [110]. If a pipeline system is disturbed
either maliciously or non-maliciously, it can cause significant loss to the economy. Hence, we
must understand the resilience of such infrastructures and develop strategies to protect them
from CPAs. In this section, we provide a detailed description of Oil Pipeline CPS (OPCPS) by
first explaining the motivation of the players, second, we provide the understanding of the
cyber and physical targets and finally, challenges faced by the players.
7.1.1 Motivation of Players
The motive of the Cyber Attacker (CA) is to reduce the operational resilience of the
system. To achieve this, she performs a variety of cyber attacks [5] [7] such as malware
injection, data integrity, DDoS, unauthorized access, system hijack, etc. with a motive to
compromise different cyber targets (CT) to control physical processes. The goal of the Physical
Attacker (PA) is to steal oil from the pipeline segments to sell that oil on the black market to
raise funds [9].
CA and PA can coordinate with one another to maximize their rewards. The motive of
the CA to coordinate is to receive a portion of oil or money (after selling oil) from the PA. And
at the same time, she can reduce the operational resilience of the system (her primary motive)
since she has already comprised the CTs before she helps the PA. On the other hand, the
motivation of the PA to coordinate is to steal oil from the pipeline and to reduce the probability
of getting caught or detected. For instance, if the CA can reduce the situational awareness of a
high-pressure transmission pipeline by deceiving the defender by showing misleading pressure
readings displayed on the HMI, the PA will have more time to steal oil from that location, and
her chances of getting caught will be reduced. It becomes tough for the defender to detect such
coordinated attacks since the information she is receiving is misleading. The defender is
interested in the overall security of the system. Her motive is to deploy the cyber resources to
176
protect the cyber and network infrastructure and assign patrol teams to the subset of the
pipelines so that to minimize the oil stealing activities.
7.1.2 Understanding of Cyber Targets
Before protecting or compromising any CT, it is important to compute its severity level.
The motive to calculate the severity level is to get an idea about the degree of impact on the
system if a CT is compromised. For that, we need to identify the vulnerabilities associated with
the CTs. Although we can describe the number of ways to determine the vulnerabilities, such
description is out of the scope of this work. Once we know the vulnerabilities, we need to
assign vulnerability score CVSS [80]. Once we know this score corresponding to each
vulnerability, we can compute the overall score of the CT by taking the average. This score
represents the severity level associated with that target. It also helps the system admin to define
the cost of protecting and compromising a CT and depicts which security control should be
deployed on that target.
Once we have understood the severity level, we need to know how important the CT is
from the physical process viewpoint. In this case, we need to know the percentage (or weight
wi) of impact on the volume of oil delivered per day if a CT is compromised. The wi depends
on the factors such as information loss, control loss and whether it has a direct or indirect effect.
Here, the information loss means whether it affects the oil delivery if a CT is not receiving any
information (such as pressure and speed of the oil in the pipeline) about a physical process.
Similarly, in control loss, if a CT is not able to control some physical processes such as oil
pressure in a pipeline, whether it affects the oil delivery or not. The wi of a CT may be directly
measured, learned from the attack data, determined by the system engineers or by
understanding the network topology and system functioning. CA can perform the variety of
attacks to introduce the information and control loss. For instance, a DDoS attack on the
177
communication infrastructure or an attack on the metering system causes the monitoring and
information loss of the SCADA system. We also need to consider the cost of applying the
security control to a CT because the defender has limited resources. It depends on:
1. cost of a software patch,
2. how long it takes to patch a software, and
3. maintenance cost.
7.1.3 Understanding of Physical Targets
The importance of the physical targets depends on:
1. the average amount of oil flowing through a pipeline segment daily,
2. population density in the area of a pipeline segment,
3. pipeline condition (old or new),
4. sensors placed or not on the pipeline at certain intervals,
5. how far it is from the initial position of the PA and control station,
6. defender patrolling frequency in a particular area of pipeline segment,
7. is it buried under the ground?
We have included the distance and time as discounting factors because PA has to visit
the pipeline location from some initial position and then carry oil to the final destination, which
takes time and has a cost. The system engineers and PAs evaluate these factors before deciding
which physical target to protect and attack, respectively. For instance, if a pipeline is present
in the area of high population density, there are chances that someone will see the PA stealing
oil, so he avoids the area. The penalty that CA and PA get depends on the laws of the country.
It also depends on the types of attack they have performed and the consequences of those
178
attacks. In most of the cases, the PA will always get a higher penalty since she is physically
present on the location of the attack as compared to the CA.
7.1.4 Challeges faced by players
We now describe the challenges faced by the defender and attackers in the OPCPS. To
assign security controls to the CT, the defender should perform vulnerability assessment (VA)
of the set of cyber targets. Similarly, CA performs VA to know the vulnerabilities before
planning to compromise the targets. Once vulnerabilities are identified, defender identifies
security controls specific to each vulnerability and plan how and when to deploy them.
Unfortunately, it is not trivial to implement a security control (for instance, patch software) in
the case of ICS since patching or installing new software may require shutting down of a subset
or complete ICS process. The amount of damage prevented by applying security control to a
CT is stated as mitigation [111]. To deploy the security control on a CT, some cost is incurred
to the defender, which depends on the number and types of vulnerabilities in the cyber target
and the time to deploy controls. Similarly, CA also incurs a cost to compromise targets (in
terms of time spent in finding vulnerabilities, writing malware specific to the system, security
tools purchased, etc.). According to ICS-CERT [112], “an adversary who wants to control a
critical ICS, she has to face three significant challenges”:
1. gain access to the control system local area network,
2. understand industrial control process and
3. gain control to that process.
CAs perform long-term reconnaissance operations to learn and understand the CPS. It
takes a time to learn the ICS and to perform synchronized attacks which are more intrusive and
difficult to detect. When the CA understands the cyber system, she also understands the
179
physical correspondence of the CTs such as in [5]. CA observes the strategies of the defender
and takes appropriate actions. Once she learned this science, she can use her knowledge and
ability to control the system functionality to help the PA to steal oil by disabling the security
checks. It is important for the defender to understand the ability of the different CAs to learn,
compromise and control CTs. Also, PA would be interested in learning the ability of the CA
because if the CA is not able to monitor physical targets appropriately, PA would not be
interested in coordination.
We now describe the different CAs and their ability: the ability (0.0 < <= 1.0) of a
CA to compromise the CTs to control physical targets depend on his knowledge and expertise
[92]. The first level attacker is “script kiddies” who attack randomly; their chance of detection
is high. They borrow payload or copy from other sources. The second level attacker is a
“motivated attacker”, someone who understands the system and then attacks using pre-
developed attacks. Finally, we have experienced attackers who perform “organized crimes”.
They write exploits/malwares depending on the system’s functionality such as the Stuxnet
malware. Therefore, the value of depends on the level of the CA. The script kiddies have
lower values than motivated attackers who have lower values as compared to organized crime
attackers. The agreement between the CA and PA is the amount of oil (or money after selling
oil) which PA gives to CA depends on the . If is high, she is likely to get more oil (or money)
from PA for helping her. We can use this parameter to understand how the defender should
deploy resources for different types of CAs and also how the level of coordination changes
among the CA and PA.
In the OPCPS, the compressor stations and distributed pipeline segments are vulnerable
to physical intrusions [9] because most are located in isolated areas and they are unmanned.
The challenge of protecting physical infrastructure is the broad area of physical distribution.
180
Also, systems such as compressor stations are placed after every 40-70 miles [29] across the
pipeline, and it is not cost effective to deploy personnel at each site. There are some ways these
targets are covered using CCTV cameras, patrolling teams, helicopters, etc. In this chapter, we
are concerned about the patrol teams patrolling at different pipeline locations. We have not
considered the CCTV cameras since those can be controlled from the cyber domain and
helicopters are least effective financially. The PA chooses a pipeline segment on the basis of
its importance. PA also observes the strategies of the defender because she performs
reconnaissance to learn which pipeline segments are frequently patrolled. By the knowledge
about the interaction between the cyber and physical domains, the CA knows when the oil is
flowing, what is the operating pressure, temperature etc. Thus, she can help PA in stealing
some extra amount of oil. To summarize, the key challenges for the defender to allocate the
resources:
1. CA observes the mixed strategy of defender,
2. CA can find out the vulnerabilities of the cyber system by performing VA, performs
variety of attacks on multiple cyber locations,
3. CA understands the cyber and physical network and can find correlation between
cyber and physical nodes/systems (similar to how attackers find out vulnerabilities
of Ukrainian power substations [5]),
4. PA can observe the patrol teams at different places and time thereby learning the
strategies of the defender (we have assumed that PA does not know anything about
the cyber domain) and,
5. CA can help the PA to steal oil from the particular pipeline segment.
181
7.2 Game-Theoretic Approach
We have modeled the actions of the system admin of the OPCPS for deploying
appropriate security controls on different cyber targets and patrolling teams on various pipeline
geographic regions, as Stackelberg Security Game (SSG) between three players. SSG domain
allows us to formulate our problem of resource allocation comprehensively and helps us to
generate optimal allocation policies. In this section, first, we define what the SSG is, second
the rewards and penalties corresponding to the targets, and finally optimal strategies of each
player.
7.2.1 Stackelber Security Games
SSG is a game where a leader commits to a strategy, and then followers choose their
strategy after observing the leader’s approach. The defender has the advantage over adversaries
because he is the leader in SSG. He will choose those strategies where adversaries will break
a tie in favor of him. Consider a simple standard form game in Table 7.1, where rows
correspond to the defender's pure strategies and columns to an attacker's pure strategies. In the
example, if the defender chooses D2, an adversary will choose A1 to maximize his utility.
However, if the defender chooses D1, an adversary will choose A2. Since defender knows, by
selecting D1, he will always get higher utility no matter what adversary may choose. Therefore,
the defender will commit to D1.
Table 7.1: Normal Form Game
Attacker (Follower)
Defender
(Leader)
Target A1 Target A2
D1 10, 5 9, 6
D2 5, 8 8, 3
182
Nomenclature:
• TC is the set of cyber targets (represented by i).
• TP is the set of physical targets (represented by j).
• Rica: reward CA gets after successfully compromise of target i.
• Pica: penalty CA gets if he is caught while compromising target i.
• Rjpa: reward PA gets after successfully compromise of target j.
• Pjpa: penalty PA gets if he is caught while compromising target j.
• wi represents the percentage of oil gets affected if target i is compromised.
• Xi represents the cost of compromising the target i.
• s(i): set of targets which can be compromised easily if CA is successful in
compromising i.
• i, z [0,1] represents the correlation between cyber targets i & z.
• Zca and Zpa represents maximum amount of punishment attackers can get.
• Vj is the average of amount of oil flowing through pipeline j daily.
• Dj represents distance travelled by the PA to reach a pipeline j from his initial position
and to go back carrying oil.
• Tj time it takes to tap into a pipeline j and to take oil from the pipeline.
• domain parameter depends on pipeline elevation, geographic regions, population
density etc.
• MTi, z the amount of damage prevented by applying a security control z to a cyber target
i
• Cz represents the cost incurred for applying security control z
• X, Y: defender’s coverage for CT and PT respectively
• C, P: mixed strategies of CA and PA to attack CT and PT respectively
183
• Rx, idc reward of defender when he chooses cyber target x and CA chooses i
• Ry, jdp reward of defender when he chooses physical target y and PA chooses j
• Cx, ica: reward of CA when he chooses target i and defender chooses target x
• Cy, jpa: reward of PA he chooses physical target j and defender target y
• mx: probability to cover cyber target x, mx X
• ny: probability to cover physical target y
• ci: probability to attack cyber target i, ci C
• pj: probability to attack physical target, pj P
• anc, bnc maximum utility of CA and PA when not coordinating respectively
• ac, bc maximum utility of CA and PA when coordinating respectively
• , are binary variables for CA and PA to coordinate or not respectively
• cij represents the correlation between the cyber target i and physical target j.
• int – represents the normalization of the values. For instance, a distance is divided by
the max distance.
7.2.2 Rewards
Cyber Attacker. Some of the cyber targets control some amount of oil flowing through
a pipeline each day directly or indirectly. CA wants to attack those targets which have the
highest impact on the system, low cost to attack and his likelihood of getting caught or detected
is small. Suppose there is a CT i TC which attacker wants to attack, the reward and penalty
corresponding to that CT i are:
Rica = int (wi Vt + z s(i) (Rzca i,z) – Xi) (1)
Pica = - Zca (2)
The value of the CT depends on the wi (%) of oil affected once CA compromises the
target i. The amount of oil flows from the pipeline t is Vt. Since the attacker knows the
184
correlation between the CT and PT, she knows which CT i controls the pipeline t. In the cyber
security domain, an attacker is interested in targets which will give her maximum privileges
directly or indirectly. Once an attacker has compromised CT i, it is possible that the CA can
compromise other targets connected to it which give her more privileges. That’s why we have
included the second term in (1). i,z represents the correlation between the two cyber targets.
This value is determined by the system engineers and by understanding the network topology.
Xi represents the cost of performing the attack. The cost is not included in the penalty because
the penalty is much higher than the cost incurred. The cost is calculated by understanding the
severity level of the target (discussed in sub-section 7.1.2). If it has many vulnerabilities, the
cost may be small, and it is easy to attack otherwise not. Zca is the maximum punishment
depends on the law of the country, and usually it is high. Since cyber attackers are performing
the attack from different geographic locations or countries, their chances of getting caught are
low. The system admin can block them over the internet to reduce the attack surface.
Physical Attacker. PA performs the physical attacks at some pipeline locations with a
motive to steal oil. PA selects a particular pipeline segment to attack on the basis of factors
described in sub-section 7.1.3. Also, PA observes the patrolling strategies of the defender and
then takes actions. The cost for the PA is the distance travelled from the initial position to the
final target and the time required to tap into the pipeline which depends on the multiple domain
variables such as condition of a pipeline segment (easier to tap very old pipeline), geographic
region, etc. Zpa is the maximum punishment depends on the law of the country, and usually it
is high. In the case of PA, there are high chances of getting caught since she is physically
present at the crime site. The reward and penalty corresponding to a physical target j:
Rjpa = int(Vj) – int ((Dj + Tj) * ) (3)
Pjpa = - Zpa (4)
185
Table 7.2. Rewards Table.
Cyber Rewards When (x = i) When (x != i)
Defender: Rx, idc Rxdc Pidc
CA: Cx, ica Pica Rica
Physical Rewards When (y = j) When (y != j)
Defender: Ry, jdp Rydp Pjdp
PA: Cy, jpa Pjpa Rjpa
Defender. The defender receives reward and penalty for protecting both the cyber and
physical targets. Although the PA steals some amount of oil from the pipeline, the impact on
the system is the average amount of oil flowing through that pipeline daily (8) for the defender.
Once PA taps into a pipeline, it is costly for the defender to repair the pipeline and sustain oil
flow at appropriate physical parameters. The reward for protecting the CT is the percentage of
risk mitigated (5) by applying a particular cyber security control z on the CT i. In both the
cases, some cost is incurred (discussed in sub-section 7.1.2). We can neglect the cost in the
case of PT since the average amount of oil flowing is very high. In equation (5), the CT has
correlation with PT t. That’s why we have used Vt to represent the volume of oil. The rewards
and penalties corresponding to a cyber target i and a physical target j are:
Ridc = int (Vt MTi, z - Cz) (5)
Pidc = - int (Vt) (6)
Rjdp = int (Vj – Dj * ) (7)
Pjdp = - int (Vj) (8)
Game Rewards. When the defender is covering a target and attacker is trying to
compromise the same target, defender receives reward and attacker receives penalty
corresponding to that target. On the other hand, if the defender is not protecting the target which
is being attacked, he receives some penalty and attacker receives reward corresponding to that
186
target. Rewards and penalties used in the game are shown in Table 7.2. (Note: x, i TC; y, j
TP.)
7.2.3 Computing Optimal Strategies
There are two matrices to represent the game between defender-CA and defender-PA
(similar to Table 7.1). The matrices represent the pure strategies, rewards, and penalties of the
players in the game. The rows of the matrices represent the pure strategies of the defender. The
pure strategy of the defender for the CT could be to deploy a firewall, verifying software
installed on meters and PLCs, install IDS/IPS, perform VA, etc. Also, for the PT, it is which
pipeline location to deploy patrol teams. The columns of the matrices represent the pure
strategies of the attackers. For the cyber attacker, the strategy could be to compromise a
particular server, database, control system, HMI, etc.
Finally, for the PA, it is to attack a particular pipeline segment region. The attackers
have to choose some set of targets to attack which have maximum impact on the system and
give them maximum rewards, and the defender has to pick some targets according to the
number of resources given to him, which minimize the impact (or maximizes the protection)
on (of) the system. Now we discuss the algorithm to compute optimal strategies of the players
when the attackers: 1) do not coordinate and 2) do coordinate.
No Coordination Among Attackers. When the attackers are not coordinating, the
defender will maximize his utility separately for both the types of targets. Since it is a SSG,
defender chooses an optimal strategy X which can be observed by the CA and then CA chooses
his strategy C which ultimately break a tie in favor of the defender. The MIQP for the CA is:
187
maxC i C xX Cx, ica mx ci (13)
s.t: i C ci = 1
ci >= 0
After evaluating the dual of this MIQP, the complementary slackness (CS) which gives
the maximum utility (anc) of the CA is ci (anc - xX Cx, ica mx ) = 0 i C. Defender’s objective
function considering the maximum utility of the CA:
maxX xX i C Rx, idc mx ci (14)
s.t: i C ci = 1
xX mx = 1
0 <= (anc - xX Cx, ica mx ) <= (1 - ci) M (15)
mx [0…1]
ci {0,1}
anc
In above MIQP, the defender evaluates all the possible ways to protect the CTs and
chooses an optimal strategy considering the best response of the attacker (15). Similarly, PA
and defender interaction can be represented in the form of simple MIQP where defender
maximizes his utility by allocating patrol teams (Y) to the pipeline locations such that PA
breaks tie (P) in favor of the defender (17). Defender’s objective function considering the
maximum utility of the PA:
maxY yY j P Ry, jdp ny pj (16)
s.t: j P pj = 1
yY ny = 1
188
0 <= (bnc - yY Cy, jpa ny ) <= (1 – pj) M (17)
ny [0…1]
pj {0,1}
bnc
In this way, defender is able to maximize his overall utility for both types of targets
separately. He does not have to care about the coordination among the attackers. The problem
becomes complicated when the attackers are coordinating and defender wants to break the
cooperation and maximize his own utility.
Coordination Among Attackers. In sub-section 7.1.1, we discussed the motivation of
both attackers to coordinate. Since CA knows the correlation between the cyber-physical
infrastructure, he compromises the cyber nodes which he finds attractive and has some physical
correspondence. He informs the PA to attack the corresponding physical node. The amount of
oil PA can steal from PT j without coordination is given by Vj. The represents the
percentage of oil he can steal on the basis of his ability which we can vary in our game by
assuming some probability distribution. The remaining amount of oil he can still steal is given
by (1 - ) Vj. Suppose is the percentage of extra oil PA can steal from a PT based on the
ability of the CA to help PA by giving additional information about the physical process and
by controlling the PT from the cyber domain. In sub-section 7.1.3, we discussed the different
types of CA, who have varying levels of knowledge and experience to perform attacks. The
cyber attacker knows whether there is correlation between the cyber and physical target (cij a
Bernoulli variable). We can vary the values of in the model. The extra amount of oil that can
be stolen is cij (1 - ) Vj. PA keeps and gives (1 - ) amount of extra oil to the CA.
depends on the agreement between the CA and PA, which depends on . But it always follow:
0.5 <= <= 1 because the chances of PA getting caught is always higher as compared to CA
189
for obvious reasons. The coordination rewards of the CA and PA is represented by the
equations (25) and (26). (Note: fix the values of , for a specific type of CA and PA
respectively, and to solve the game.)
Mixed Integer Quadratic Program
maxX, Y xX yY i C j P (Rx, idc mx ci + Ry, jdp ny pj – (1 - mx) * CRica – (1 - ny)
* CRjpa (18)
s.t: j P pj = 1 (19)
yY ny = 1 (20)
i C ci = 1 (21)
xX mx = 1 (22)
0 <= (ac - xX (Cx, ica + CRica) mx) <= (1 - ci) M (23)
0 <= (bc - yY (Cy, jpa + CRjpa) ny) <= (1 – pj) M (24)
PA: CRjpa = cij (1 - ) Vj (25)
CA: CRica = cij (1 - ) (1 - ) Vj (26)
0 <= (anc - xX Cx, ica mx ) <= (1 - ci) M (27)
0 <= (bnc - yY Cy, jpa ny ) <= (1 – pj) M (28)
- M<= ac - anc <= (1 - ) M (29)
- M<= bc - bnc <= (1 - ) M (30)
<= 1 (31)
, , mx, ny [0…1] (32)
cij, , , pj, ci {0,1} (33)
ac, bc, anc, bnc (34)
[0.5 … 1] (35)
190
Now we need to add the condition of coordination among the attackers in our game. In
the real world, attackers evaluate the benefit they receive by coordinating and not and then they
take actions. We have modeled this behavior by adding (29) and (30) in the final MIQP of the
defender where both the attackers evaluate the value of coordination by comparing with the
situation when they are not coordinating. The equations (23) and (24) represent the
complementary slackness which describes the maximum utility of both the attackers when they
are coordinating. The objective is to maximize the utility function (18) which includes the
rewards that defender will receive for protecting the CTs and PTs. The MIQP computes the
defender’s optimal mixed strategies X and Y, which maximizes the reward function when
attackers can coordinate or not.
We have discounted the coordinated rewards of both the attackers if they coordinate
and when those targets are not protected with some probability in the objective function (18).
The (31) is the constrain that represents either both the attackers coordinate or no one. We can
use some optimization solvers such as CPLEX to solve MIQP and compute the mixed strategies
of the players.
Resource Allocation. The defender needs to assign the cyber controls and patrol teams
to the CT and PT respectively based on the mixed strategies he received from the above game.
Once the defender has the probability distribution over all pure strategies (mixed strategies),
he knows which targets are necessary or not. One thing to remember, both the attackers observe
the mixed strategies of the defender. So the defender has to allocate resources such that the
attackers cannot figure out which target is given more importance, making the attackers
indifferent about the targets. So she randomly chooses targets based on the combination of
probabilities and pure strategies. It is also possible that defender protects specific targets all the
times irrespective of the probability distribution because they are central to the functionality of
191
the system. It is advisable to take the recommendations from the oil system engineers and
system administrator to add additional constraints in the game or to change the set of pure
strategies according to the real world requirements.
Challenges. One must understand that this algorithm works in the pure strategy space;
thus, it evaluates all possible choices of the defender and attackers. It might be possible that
the number of targets becomes large. To overcome this issue, we can merge the set of targets
which have similar characteristics and impact on the system. For example, system engineers
should treat the sensor network as one target and a pipeline region of 10 miles as one target,
etc. In future experiments, we will divide the pipeline segments into regions based on the ability
of the patrol team to cover the target and geographic boundaries. Since the pipeline system is
a massive system which cannot be controlled from one place, it is advisable to implement such
algorithms for a small geographic area and not at the county or state level because geographic
terrain changes when we increase the area of consideration and this ignores the features which
are particular to a region of the pipeline. Every SCADA system which is monitoring the
pipeline segment in a particular area should use this algorithm and incorporate features
(discussed in sub-section 7.1.3) according to their regions to get optimal results.
7.3 Theorems
7.3.1 Theorem 1
The solution to the game is a stackelberg equilibrium.
PROOF. We need to consider two things to prove that the solution to the game is in
equilibrium: 1) Cyber and physical attackers play in equilibrium, which means they don’t have
192
any incentive to deviate from their equilibrium strategies and 2) they both break ties in the
favor of the defender (maximum utility).
Suppose CA plays (ce, e) and PA plays (pe, e) in equilibrium. And suppose (c, ) of
CA and (p, ) of PA are any other actions they can play. Utility after playing (ce, e) and (pe,
e) (Ue) should be greater than any other utility for playing (c, ) and (p, ). Now we have two
cases, first when e = & e = and second is when they are not equal. In the first case, when
actions taken at equilibrium and not at equilibrium are same, both players get the maximum
utility since equations (23-24) and (27-28) computes maximum utilities. Therefore, utilities at
equilibrium are no greater than any other utility whatsoever the actions played by attackers. In
the second case, when actions taken at equilibrium and not at equilibrium are not same, attacker
compares the utilities obtained from both the cases using equations (29) (30) and then take
action which gives maximum utility. Now we need to prove that attacker break ties in favor of
defender which is the condition for the Strong Stackelberg equilibrium. Anytime when
attackers are computing their utilities, they are considering the best response of defender since
they know the mixed strategies of the defender mx and ny. And the objective function (18) is
also to maximize the defender’s utility. Therefore, both the attackers will break ties in favor of
the defender.
7.3.2 Theorem 2
The defender is able to break the cooperation between the cyber and physical attacker
using her optimal mixed strategies from the game, by allocating resources such that either of
the attackers get utility more for not coordinating than by coordinating.
PROOF. If defender allocates the resources in such a way that either of the attackers
get higher utility for no coordination, one of them will not coordinate thereby breaking the
193
coordination. For instance, the defender can trick the PA to show that a particular area of a
pipeline is not monitored frequently, and PA can tap into that pipeline very quickly and for a
long time. In this case, PA will not coordinate with the CA since she can extract a significant
amount of oil without her help, her chances of getting caught (which appears to be) is low, and
she does not have to share some portion of oil with CA. Thereby breaking the coordination
from the physical side. We cannot ignore the fact that it is difficult for the defender to break
the coordination from the cyber side because the primary motive of CA is to compromise the
cyber system and to reduce its operational resilience. And if given a chance, she can help PA
for some extra rewards. The conditions for no cooperation among attackers are:
ac <= anc or bc <= bnc
Using equation (23) and (27), we can say that:
anc >= xX Cx, ica mx
ac >= xX (Cx, ica + CRica) mx
Subtracting both of above, we get
anc >= ac - xX CRica mx
Similarly, using (24) and (28), we get
bnc >= bc - yY CRjpa ny
Add above equations, we get
ac + bc < anc + bnc - xX CRica mx - yY CRjpa ny
For breaking the coordination, either ac = anc or bc = bnc, we get
ac - anc < x X CRica mx + yY CRjpa ny or,
bc - bnc < xX CRica mx + yY CRjpa ny
194
If a defender somehow allocates resources so that the difference between maximum
utilities when they are coordinating (and not) is less than the sum of the coordinated rewards
of both attackers, the attackers will not coordinate with each other.
7.4 Conclusion
The contribution of this chapter is the theoretical model to model the oil stealing using
a game-theoretic approach. This chapter provides a detailed description of the OPCPS by
defining the motivation of the players, by providing the understanding of the cyber and physical
targets and challenges faced by the players. The second contribution lies in the modeling of the
coordination between two distinct attackers who have different motives to attack two different
types of targets, cyber and physical, with coordination to maximize their rewards. The
coordinated rewards that cyber and physical attacker receive depend on their ability to perform
an attack and correlation between CT and PT. We introduced parameters that define these
conditions; the system admin can vary these parameters according to different requirements.
Such coordination formulation can be extended to use in other problem domains. We have also
defined the rewards and penalty equations that players receive corresponding to each target.
This work identifies the space of correlation between cyber and physical infrastructure in the
context of OPCPS. It not only assists the oil pipeline system administrator to allocate resources
efficiently but also provides a detailed understanding of what factors to consider to model the
problem of resource allocation in CPSs. This formulation can be extended to model the
problem of gas pipelines or in other related fields.
195
Chapter 8
Case Studies
In this chapter, our main motive is to show how real-world blackouts could have
originated due to cyber attacks on the power grid by taking examples of US & Canada and
India blackouts. Moreover, we show the application of governor system (discussed in Chapter
5) in the field of autonomous cars, application of the end-to-end risk assessment methodology
(discussed in Chapter 4) and attack-defense appraoch (discussed in Chapter 3) in the oil & gas
pipeline system.
8.1 Historical Blackouts and Attacks
In this section, we briefly describe some of the real-world historical blackouts that could
have originated due to cyber attacks on the power grid. Here are the steps, we follow for each
scenario: 1) discuss a real-world scenario, and 2) how a real-world scenario could have
originated due to cyber attacks? The attacks are classified into two categories: C1: Cyber
Attacks for disabling Cyber functions of the power grid, and C2: Cyber Attacks for creating
Power Demand-Supply Mismatch.
8.1.1 2003 Blackout in the US and Canada
The US-Canada blackout in 2003 is one of the most massive blackouts in the history
which affected almost 50 million people. The failure of a generator (East Lake Unit 5) in
northern Ohio led to this blackout. The generator got tripped because power engineers were
not able to predict the reactive load requirements on that hot day. Power engineers perform
196
real-time contingency analysis (RTCA) to simulate the various contingency situations and
outages to evaluate the reliability of the power system. RTCA tools help operators to
understand the system dynamics and discover if the system is operating securely. On the day
of the blackout, Mid Western System Operator (MISO) state, estimator did not reflect the actual
status of the system. The Cinergy's Bloomington-Denois Creek 230-kV line was out of service,
but state estimator showed it was working fine. The operators tried to correct the state, but they
were not successful. The system operators were not able to perform contingency analysis
because of incomplete situational awareness. Since there was no contingency plan for the
unpredicted reactive load in the northern region of Ohio, the generator failed to cause cascading
failures. The leading causes for the blackout are Inadequate Situational Awareness and
Inadequate System Understanding. For more details, refer to [123].
How could this blackout have originated due to cyber attacks on the power grid? The
attacker could have compromised (C1) the control system that maintains the state awareness,
or, compromised the RTCA tool to perform incorrect contingency analysis, or, performed DoS
attack or false data injection attacks on the state estimator to prevent correct readings reach to
RTCA tool. In all these cases, the information displayed on the HMI is incorrect and
misleading, and therefore, it is difficult for engineers to take relevant action during the attack.
Another set of possible attacks is (C2): the attackers could have compromised DRAS
and send malicious load shedding commands to create an unpredicted load drop causing the
frequency to go beyond the secure threshold. The attackers could have compromised the DERs
devices and disconnect them from the grid to prevent power dispatch during peak demand
hours causing demand-supply mismatch as shown in [28], or, performed pressure integrity
attack on the natural gas distribution pipeline [28] to prevent gas reaching peaker plants and
generating power during peak hours. All these attacks create an imbalance between power
197
demand and supply due to which frequency of the grid went beyond the threshold (under and
overprotection) resulting in generator shutdown. This might be the cause of a failed generator
in the northern region of Ohio that resulted in the US-Canada blackout.
8.1.2 2012 Blackout in India
India suffered two significant blackouts in July 2012. The outage affected more than
620 million people (half of India’s population), causing the most massive power outage in
history. The cause of power outage was the tripping of circuit breakers on the Bina-Gwalior
line. Bina-Gwalior line fed into the Agra-Bareilly transmission section. Once it got tripped, the
breakers at the Agra-Bareilly section, also tripped and that led to cascading failures throughout
the grid. All major power stations got shut down, causing the estimated shortage of 32 GW of
power. The aftermath of the power outage caused the shutdown of railways, airports, traffic
signals, water treatment plants (no clean water supply to millions who relied on electric water
pump) and even hospitals reported interruptions in health services. For more details, refer to
[124].
How could this blackout have originated due to cyber attacks on the power grid? A
circuit breaker connects a generator to the grid. It protects the electric circuit from overloading
and short circuit. The cause of power outage in India was the tripping of circuit breakers on the
Bina-Gwalior line. This could have initiated by the attacker who compromised the cyber
system that manipulates the configuration of relays that protect circuit breakers from causing
line tripping. By sending malicious signals from the cyber domain (C1) to the breakers, the
breakers got tripped resulting into failed generators, which led to cascading failures.
198
Figure 8.1: Example of architecture of in-vehicle network. This figure is borrowed from
[118].
8.2 Governor for Autonomous Cars
In this section, we provide the application of the Intelligent Governor in the autonomous
cars. We briefly define the threat model, the set of malicious actions taken by an adversary and
finally, governor functionality.
Threat Model. An adversary controls the Electronic Control Units (ECU) [118] of a
car through physical and remote access [119,120,121]. In physical access, an adversary, with
momentary access to the car, installs a malicious component into the in-vehicle network
(shown in Figure 8.1) via ubiquitous On-Board Diagnostic (OBD-II) port. A malicious
component is programmed to disconnect various car functions when they are not required and
safe. An adversary understands the packet structure using packet fuzzing via Controller Area
Network (CAN) [120,121], the car diagnostic tool. Then, she installs those manually fuzzed
commands into the malicious component to trigger attacks when the impact is high by
leveraging the information from the vehicle’s different components.
199
Another vector to access ECUs is through a wireless interface (both short and long
range) implemented in an autonomous vehicle. An adversary sends malicious signals to ECUs
remotely over the internet to control components’ behavior. The commands over the internet,
using offboard WiFi or VANet or cellular network, enter into the in-vehicle network from V2X
head unit (shown in Figure 8.1). Some of the malicious actions performed by an adversary to
control: 1) brakes, 2) door locking, 3) windshield fluid shoot, 4) horn and its frequency, 5) pop
trunk, 6) engine functions, etc.
Governor for Autonomous Cars. We define an intelligent governor to prevent attacks
that are mounted over the internet to control various car functions. It does not prevent attacks
that are mounted from inside, such as malicious software installed onboard to fuzz packets, the
in-vehicle network.
Governor should be placed adjacent to V2X Headunit. It gathers the information about
the state of a car, such as whether it is moving or not, at what speed, doors are locked, etc.,
from the onboard ECUs. For instance, if we install a governor in a Mercedes Benz model, it
interacts with the PreSafe system [122] to gain information about the collision detection and
decide whether to apply brakes or not. Governor understands the packets generated by the V2X
Headunit, and it knows to which ECU packet belongs to. Packets are in CAN format with an
ID. It understands what packet is trying to achieve and accordingly it evaluates the impact on
a car if this packet is allowed to go through. Therefore, a governor needs to be implemented so
that it has the inherent capability to understand the packets flowing in the in-vehicle network
and commands coming from the external network.
When a malicious door locking or unlocking or apply brakes command issued by an
adversary from the external network, it is captured by the governor before V2X Headunit places
it on the onboard network. In the absence of the governor, V2X places that command on the
200
network using CAN protocol, which is then captured by the Body Controller and it issues a
command to door actuators to perform locking and unlocking irrespective of the state of the
car. When the governor is present, it evaluates the commands whether they are required and
safe for a car. If a car is moving, it disallows the unlocking command. If a car is not moving,
it allows the unlocking command. In case of malicious brake command, it disallows disabling
brakes command since it is not required and safe for a car. For malicious break command, it
interacts with the onboard safety system to decide the requirement and safety of applying
brakes and then allows or disallows command. For pop truck command, it disallows when a
car is moving. The governor disallows any commands regarding engine control unit
manipulation. In this way, the car governor prevents cyber-physical attacks that are originated
remotely to control a car’s ECUs. For more sophisticated attacks, such as fuzzing packets via
malware that are equally likely, we can place a governor on ECUs (edge devices) so that before
executing any command, ECUs should decide whether commands are required and safe. As
part of future work, we need to look at how it should be done and whether it is efficient and
effective.
8.3 Application of Attack-Defense Approach
A group of adversaries performs multiple attacks on the gas pipeline distribution
system. First, they attack the control system to control the pump station. The pump station is
responsible for pushing the gas forward to the gas distribution to customers. Once it is
compromised, the gas stops flowing from the utility. The system admin instructs storage
facility to compensate for the loss. Now, attackers perform a cyber or physical attack on the
storage facility to obstruct its functionality. In response to this, the defender sends control
signals from the SCADA to control the physical process and mitigate the attack. In response to
this, an adversary performs a DDoS attack on the communication infrastructure to reduce the
201
response of the system and preventing gas from flowing forward. Using this approach, we
discover what are the various factors that should be considered while developing secure
systems.
8.4 Application of End-to-End Risk Assessment Methodology
First, derive the cyber domain network of the oil & gas pipeline system and construct
Bayesian Network. Second, identify the importance of each function and its influence on the
physical system. In other words, understand if a function is compromised, what physical
functionality is manipulated. For instance, the SCADA system controls the pressure of the gas
flowing through a pipeline as it sends pressure commands to the compressor stations. Third,
perform a vulnerability assessment to discover known vulnerabilities and their probability of
exploitability. Fourth, use the computed exploitability probability and Bayesian Network to
compute the likelihood of compromising every function. Fifth, understand what are the
different types of cyber-physical attacks possible from each function and perform simulation
in the pipeline simulator to understand up to what extent the system is resilient in the presence
of an attack. Then, quantify the impact in terms of the physical properties of the system, for
instance, the pressure of the gas. Finally, we compute risk using the likelihood and impact
values. System engineers understand what is contributing more to the risk: the likelihood or
impact, and accordingly they decide how to reduce risk. In chapter 7, we presented game-
theoretic approach to decide how to allocate resources in the cyber and physical domain of the
oil pipeline system in order to prevent oil stealing.
202
Chapter 9
Thesis Discussion
My thesis statement asks: How does the manipulation of distinct functional components
affect smart grid resilience? How should we allocate resources to maintain resilience? In this
chapter, we discuss how the end-to-end risk analysis methodology is useful in understanding
the impact of manipulating distinct functional components of the grid. Moreover, we discuss
how power protection framework, game-theoretic, and reinforcement learning approaches
assist power engineers in allocating resources to maintain the resilience of the system. Finally,
we discuss the limitations of the overall research.
9.1 Thesis Discussion
In this dissertation, we presented the end-to-end risk management approach to defend
smart grid, industrial control system, from cyber-physical attacks. Although there is an
extensive work done in the literature to evaluate the risk of an attack, mostly they do not
consider the complete definition of the risk (that is the likelihood of an attack times its impact).
Approaches demonstrate how resilient the grid is against attacks and claim that they assist
power engineers in identifying factors that should be considered while developing robust
systems. In reality, they do not perform a complete risk assessment and do not assist power
engineers in deciding what actions to take to reduce risk.
In this dissertation, we answer these questions by first describing the end-to-end risk
assessment methodology, and demonstrating how to model the cyber and physical system to
203
quantify risk. We support our methodology by showing how to compute risk for manipulating
circuit breakers through Electricity Control Center in the cyber domain in the presence and
absence of a Photo Voltaic unit. The end-to-end risk analysis is repeatable, which can be used
in computing the risk of an attack in the grid or other industrial control systems. Table 9.1
shows the contributions to risk assessment approaches in the smart grid and oil & gas pipeline
systems. We used a variety of approaches to evaluate the resilience in the presence of multiple
cyber-physical attacks and quantify risk of a cyber-physical attack.
Second, we evaluated the resilience of the grid in the presence of multiple cyber-
physical attacks and show how real-world attacks happen using an Attack-Defense approach.
Here, we broaden the areas (functional components), including DER units, AMI worm
propagation, and peaker power plants, considered in the risk analysis. We use the function-
based methodology [24] to simplify and abstract the evaluation process for evaluating the
resilience of both the grid and oil & gas pipeline systems.
Third, we extended our risk analysis to assist power engineers in deciding on how to
reduce risk (second step of risk management). In order to reduce risk, there are two options:
1. Reduce the likelihood of an attack on the cyber function of the system.
2. Reduce the impact of an attack.
a. Deploy resources in the physical system
b. Deploy resources in the cyber system
Table 9.2 shows the contributions to risk mitigtion approaches in the smart grid and oil
& gas pipeline systems.
204
Table 9.1. Contributions to Risk Assessment.
Risk Assessment
System Approach Attacks Motive Resilience
Stage
Paper
Index
Smart
Grid
Attack-Defense Worm propagation over
AMI to control DERs
Attack on gas pipeline
Evaluate
resilience
-
[27]
[28]
End-to-End Risk
Assessment
Methdology
Manipulating Circuit
Breakers via ECC
Quantify Risk Avoidance
[32]
Bayesian
Network
- Compute the
Likelihood of
cyber system
compromise
Avoidance
[31]
Oil &
Gas
Pipeline
Function-Based
Methdology
Pressure Integrity
Attack on Gas
distribution pipeline
DDoS attack on WMN
of a pipeline
Evaluate
resilience
-
[29]
[30]
Table 9.2. Contributions to Risk Mitigation.
Risk Mitigation
Systems Approach Motive Policy Resilience
Stage
Papers
Index
Smart
Grid
Systems Approach:
Replacing
generator by
Photovoltaic
Reduce Risk in
the Physical
Domain
No policy Avoidance
[32]
Systems Approach:
Governor
Reduce Risk in
the Cyber
Domain
Actions should be
required and safe
for the local system
Containment
[35]
Reinforcement
learning
Reduce Risk in
the Cyber
Domain
Actions to decide
which cyber
function to patch or
scan
Containment
[34]
POMDP Maintain
resilience at
minimum
operating cost
Actions to take in
the physical and
cyber domain
Recovery
[36]
Oil &
gas
Pipeline
Game-theoretic
Approach
SSG
Prevent oil
stealing by
protecting cyber
and physical
targets
Patroling on
physical targets and
choose cyber
targets to protect
Avoidance
[37]
205
We presented following approaches:
1. The smart grid is a mission-critical system, and it is impossible to scan or patch all
system at the same time. Power engineers cannot place all systems on the standby mode
and perform cyber operations. In order to decide which system to scan or patch in a
particular system state, we presented the reinforcement learning approach. By deciding
a system which is essential to take action, we reduce the likelihood of an attack, thus
reduce overall risk. This comes under the avoidance stage of resilience since we are
making the cyber systems secure so that attackers find a hard time compromising the
system.
2. We reduce the impact of manipulating circuit breakers in the grid by replacing
generators with a PV unit or a power supply system which has similar properties. Due
to the fast current and voltage controller response of the PV system, the attack is
impacting is mitigated even the cyber attack was successful. This comes under the
containment stage of resilience since we are preventing the propagation of cyber attacks
to the physical domain.
3. We reduce the impact of an attack even a cyber system is compromised, introducing
the concept of the governor. A governor evaluates the safety and required property of
the commands and then allow them to go through. We demonstrate the power of the
governor by designing a governor for the demand response functionality and prevent
malicious load curtailment and shedding commands to get executed in the physical
domain. This comes under the containment stage of resilience since we are preventing
the propagation of cyber attacks to the physical domain.
4. We formulated a theoretical framework to model the interaction between the system
admin of the grid and an adversary in the form of Partial Observable Stochastic Game.
Here, we fix the strategy of an adversary and convert the problem as POMDP for the
206
system admin. The model provides the optimal policy for the admin to take specific
actions in a particular state so that the overall system is resilient with a minimum
operating cost. This comes under the recovery stage of resilience since we compute
optimal actions to bring the grid to its normal state after CPAs.
Fourth, we formulated the real-world problem of cyber and physical attacks on oil
pipeline infrastructure by proposing a Stackelberg Security Game of three players: system
administrator as a leader, cyber and physical attackers as followers. The novelty of this work
is that we have formulated a real-world problem of oil stealing using a game theoretic approach.
The game has two different types of targets attacked by two distinct types of adversaries with
different motives and who can coordinate to maximize their rewards. The solution to this game
assists the system administrator of the oil pipeline cyber-physical system to allocate the cyber
security controls for the cyber targets and to assign patrol teams to the pipeline regions
efficiently.
The results of the risk evaluation and resource allocation approaches can be used to
build and refine security policies that maintain the resilience of the smart grid to disturbances
caused by malicious and non-malicious threats. They can also be used to influence the design
of the system so that resilience is taken into consideration during the design process.
9.2 Limitations
The limitations of the approaches proposed in this dissertation are as follows:
Resilience Modeling Without Energy Storage. We have not incorporated the energy
storage in the grid’s resilience modeling. The storage units of the power plants must be scaled
up and down according to the demand of power in the market. To provide decentralized
207
generation of energy, the concept of Distributed Generation (DG) is introduced. DG is a small
scale generation and storage units that use renewable energy such as solar, wind, tidal, etc. DG
has the potential to provide efficient power generation and distribution by reducing
transmission and distribution losses. To meet peak demand during peak hours, gas is also stored
as a fuel that can be used to generate power through peaker power plants. Furthermore, some
portion of the electricity is kept at the transmission level which can be used to meet the
unexpected demand. Therefore, storage of energy at different levels is essential for the proper
operation of the grid to meet demand all the time.
A cyber attack on a gas pipeline infrastructure providing gas to gas-fired peaker plants,
false data injection attacks on DER and DDoS attack on AMI show that cyber attackers have
the ability to attack energy storage systems and affect the SGR. Hence, it is necessary to
identify and evaluate the risks associated with the energy storage systems and develop
resilience models to perform risk assessment in the presence of various CPAs.
Zero-day Attacks. In order to compute the likelihood of an attack of the grid’s cyber
functions, we compute the CVSS score based on the known vulnerabilities called as the known
risk. We do not account for the unknown risks, i.e, zero-days. The known risk is the basis of
most of the attacks on the cyber system and by considering it, we have provided a methodology
(reinforcement-based learning, section 5.1) to avoid it. Therefore, relatively, we are reducing
the risk in terms of known risk. This becomes the avoidance stage of resilience. If an attack is
successful based on the unknown risk, we have provided two capabilities to reduce the impact
of attacks in the physical (Photo Voltaic in section 5.2) and cyber system (governor in section
5.3). These approaches prevent attacks (that are originated because of zero-days) from
propagating to the physical domain. Finally, if an attack happens in the physical domain, we
provided an efficient framework (in Chapter 6) to take actions against an adversary to recover
208
from those attacks. By presenting approaches at all the levels of resilience, indirectly, we have
accounted for the unknown risk as well.
Too Many Sensitive Parameters. It is necessary to identify essential cyber and
physical parameters that can be used to perform sensitive analysis of the grid. Since such
systems are complex federated systems, the list of parameters would be huge. And it would be
tough to identify those parameters and analyze their resilience.
Applicable to Real World. In this study, we have used commercial software such as
Power World Simulator to demonstrate attack scenarios on the physical power systems. Such
software does not cover all the scenarios of the real world systems. For instance, PowerWorld
Simulator attempts to model reality to the extent that any other simulator does. This software
is not used for models of any size and at all time frames. It models in steady state and on the
10-30 second response level (transient stability) but not any time frame smaller than that. Thus,
one of the limitations of our research is how the results on small magnitude system map to
sizeable real-world systems. Nonetheless, it assists system engineers in securing backup to
backup options, so that primary function is delivered even in the presence of attacks. System
engineers understand it is not just the primary function; instead, backup functions can be
attacked to cause significant impact on the overall system.
209
Chapter 10
Conclusion and Future Work
In this chapter, we conclude the dissertation by first introducing a summary of the work.
Then, the main contributions of this dissertation are listed. Finally, directions for future work
are presented.
10.1 Summary
A part of the thesis statement asks the question: how does the manipulation of distinct
functional components affect smart grid resilience? We answered this question by providing
two methodologies to perform a risk assessment and evaluate resilience in the presence of
attacks: End-to-End to Risk Assessment and Attack-Defense approach. In the former, we
compute the risk by including the likelihood and impact of an attack. In the latter, we show a
sequence of attacks on various stability levels of the grid affects its stability. Moreover, we
provide examples of how both these methodologies can be used in other cyber-physical
systems. Based on the results of the risk assessment methodology, we discuss what do we mean
by risk and how it is useful to power engineers.
A part of the thesis statement asks the question: how to allocate resources to maintain
smart grid resilience? We answered this question by showing how to allocate resources in
different scenarios that ultimately reduce the risk. First, we presented a reinforcement learning
based approach to compute optimal policy to decide which cyber system to patch or scan. The
primary motive is to reduce the likelihood of compromising cyber functions, thereby reducing
210
the overall risk. Second, we demonstrated how a governor is used to prevent the cyber attacks
from propagating to the physical domain, thus making the impact to zero and thereby reducing
the risk. Third, we presented PSP framework to decide what actions power engineers should
take in a particular state, such as whether to decrease or increase power reserve, perform load
curtailment or load shedding, repair a node or not, in order to minimize the cost of operation
and maintain power system stability. Moreover, we provided a game-theoretic framework to
prevent oil stealing by allocating resources in the cyber and physical domain.
Finally, we described two historical blackouts (US & Canada and India) and how our
research is useful in understanding those scenarios and how it assists engineers in taking
relevant actions to maintain the resilience of the system.
10.2 Contributions
We presented the end-to-end risk management approach which spans from avoidance
to recovery phase of resilience. The main contributions of this work can be summarized in the
following points:
1. We presented an Attack-Defense approach (Chapter 3) to evaluate the resilience of a
smart grid system in the presence of multiple cyber-physical attacks. The key idea is to
broaden the surface area of attack by including distinct functional components of the
grid, such as advanced metering infrastructure, gas pipeline system, and distributed
energy resources.
2. We demonstrated the end-to-end risk assessment methodology (Chapter 4) to quantify
risk for a cyber-physical attack on the grid. It combines both the likelihood and impact
of an attack. We modeled the cyber domain as a Bayesian Network to compute the
likelihood of compromising ECC. Then, we performed contingency analysis to
211
quantify the impact of manipulating circuit breakers. We combined the results of the
likelihood and impact to compute risk. Finally, we discussed what do we mean by risk
and how it is useful to power engineers. This is the preliminary step in the avoidance
stage of resilience.
3. We presented two risk mitigation approaches (Chapter 5) to demonstrate how to reduce
risk. Reinforcement learning based methodology to decide which system to scan or
patch in a particular system state. The primary motive is to prevent the grid’s cyber
domain from getting compromised, i.e, avoiding risk. Moreover, we demonstrated how
replacing a generator with a Photo Voltaic unit make the system resilient to circuit
breaker manipulation attack and, how governor system can be used to prevent demand
response malicious commands to go through from a compromised cyber system. Both
approaches fall under the containment phase because cyber system is already
compromised and leveraged by an adversary to mount attacks on the physical system.
These approaches prevent attacks from propagating to the physical system and thus
reducing risk by containing the functionality of the compromised system.
4. We presented a theoretical framework to formulate Power Storage Protection
framework (Chapter 6) against a fixed opponent (adversary). We fix the strategy for the
adversary and model the problem as a POMDP from the perspective of the defender
(power utility) and solve it using POMDP solver. We provide experimental results to
support our claim using a simplified PSP scenario in which optimal POMDP policy is
computed efficiently. By implementing this approach, our main motive is to recover
from attacks so that to bring the system’s functionality to a desired performance level.
Thus, this work falls in the recovery stage.
5. We presented a theoretical framework for formulating and solving oil stealing problem
(Chapter 7) where cyber and physical attackers may coordinate and attack oil pipeline
212
system components to maximize their profit. We propose an SSG of three players:
system administrator of the oil pipeline as a leader (defender), the cyber attacker, and
the physical attacker as followers. This game has two different types of targets being
attacked by the two different types of adversaries with distinct motives and who can
coordinate to maximize their rewards. This work falls in the avoidance phase of
resilience since the primary motive is to avoid oil stealing attack at various locations
by allocating respources in the cyber and physical domain.
Overall, in this dissertation, we presented how to avoid risk by performing complete risk
assessment methodology to quantify the risk of a cyber-physical attack, and discuss what the
different ways to reduce risk in the cyber and physical domain. Finally, we discuss how to
avoid and recover from attacks using efficient and effective ways for resource allocation
against real-world adversaries.
10.3 Future Work
Our primary focus in this dissertation is to demonstrate the end-to-end risk analysis
methodology to compute the risk of a cyber-physical attack and discuss how to reduce the risk
in different ways. While the use cases presented in this dissertation are smart grid-related, we
believe that the same methodology can also be used to compute the risk of a cyber-physical
attack in other cyber-physical systems like oil & gas, water treatment plant, and nuclear plant.
Our thesis contribution enables the following future work.
In chapter 4, we described BAGS tool to quantify the likelihood of an attack of the
smart grid cyber domain. Due to the dynamics of the cyber and physical environment and
adoption of technology in the ICS, the risk associated with the system components is also
changing. Thus, one can develop BAGS tool using the real world dataset and implement it in
213
a SCADA system. Future work is to implement this tool and add features such as alert
mechanisms and posterior probabilities.
In chapter 5, we demonstrated how to reduce the impact of manipulating circuit
breakers by replacing Photovoltaic units. The question arises in this scenario is that what is the
percentage of PV units should be present in the system? What are the factors that decide how
many PV units should be present in the grid? Can we generalize those factors for all smart grid
systems? It is essential to answer such questions so that power engineers can make efficient
use of renewable resources without making the system unstable.
In chapter 6, we described the PSP problem to maximize the resilience of the system
and minimize the cost to the system admin. For future work, one can plan to develop techniques
to scale the PSP problem both with and without real-world data. Moreover, consider alternative
fixed strategies for different categories of the attacker and simulate the defender’s response
against them. In chapter 7, we discuss the theoretical approach to the oil stealing problem. For
future work, we will perform human subject experiments using the Amazon Mechanical Turk
(AMT). Since the attackers are stealing oil from different locations continuously, one can
model the problem as a repeated security game (RSG) in the OPCPS context.
10.4 Concluding Remarks
In this dissertation, we presented how to perform end-to-end risk management of a
cyber-phsical system. Our research improves the state of the art by linking the risk assessment
with the risk mitigation. We demonstrated how the results of the risk assessment is useful to
system engineers in deciding what actions to take to mitigate (reduce) risk, which has been
largely unaddressed by the current literature. Furthermore, we show the application of our
approaches in other cyber-physical systems such as oil & gas pipeline and autonomous cars.
214
The results of this study can be used to derive secure policies, develop and design more robust
and resilient systems and also improve the response of the system to ongoing attacks. From
this point, we should start focusing on developing models to mitigate different types of risk
and verify the results on the large scale real-world systems.
215
References
1. US Department of Homeland Security, Energy Sector http://www.dhs.gov/energy-sector.
Accessed in Dec 2015.
2. Baheti, R., & Gill, H. (2011). Cyber-physical systems. The impact of control technology,
12, 161-166.
3. Neuman, C., & Tan, K. (2011, October). Mediating cyber and physical threat propagation
in secure smart grid architectures. In Smart Grid Communications (SmartGridComm),
2011 IEEE International Conference on (pp. 238-243). IEEE.
4. Chen, T. M., & Abu-Nimeh, S. (2011). "Lessons from stuxnet." Computer, 44(4), 91-93.
5. Analysis of the Cyber Attack on the Ukrainian Power Grid, March 2016
http://www.nerc.com/pa/CI/ESISAC/Documents/EISAC_SANS_Ukraine_DUC_18Mar2
016.pdf. Accessed in Dec 2017.
6. Tripwire Study: Cyber Attackers Successfully Targeting Oil and Gas Industry, (2016)
http://www.tripwire.com/company/news/press-release/tripwire-study-cyber-attackers-
successfully- targeting- oil-and- gas-industry/. Accessed in Dec 2015.
7. The Map That Shows Why a Pipeline Explosion in Turkey Matters to the U.S.
http://www.bloomberg.com/news/2014-12-10/the-map-that-shows-why-a-pipeline-
explosion-inturkey- matters-to-the-u-s-.html. Accessed in Dec 2015.
8. Terror Attack On Algerian Gas Plant Raising Security Fears for North Africa’s Oil and Gas
Infrastructure, March 2016. http://www.ibtimes.com/terror-attack-algerian-gas-
plantraising-security-fears-north-africas- oil-gas-2341217. Accessed in March 2016.
216
9. Mexican cartels steal billions from oil industry, September 2014.
http://fuelfix.com/blog/2014/09/25/mexican-cartelssteal-billions-from-oil-industry/.
Accessed in March 2016.
10. Shell loses 110,000 bdp to oil theft, vandalism
http://234press.com/index.php/2015/09/19/shell-loses-110000bdp-to-oil-theft-vandalism/.
Accessed in March 2016.
11. Norway's Oil Companies Largest Coordinated Attack.
https://www.duosecurity.com/blog/norway-s-oil-companiestargets-of-largest-
coordinated-attack. Accessed in March 2016.
12. Lloyd’s Emergency risk report of 2015.
http://darkmatters.norsecorp.com/2015/07/08/lloyds-lossesfrom-attack-on-power-grid-
could-top-one-trillion-dollars/. Accessed in April 2016.
13. Here's how the Porter Ranch gas leak could lead to power outages. (2015)
http://www.scpr.org/news/2016/03/24/58622/here-s-how-theporter-ranch-gas-leak-could-
lead-to/. Accessed in March 2015.
14. Sridhar, S., Hahn, A., & Govindarasu, M. (2012). Cyber–physical system security for the
electric power grid. Proceedings of the IEEE, 100(1), 210-224.
15. Tan, Rui, Hoang Hai Nguyen, Eddy YS Foo, Xinshu Dong, David KY Yau, Zbigniew
Kalbarczyk, Ravishankar K. Iyer, and Hoay Beng Gooi, "Optimal False Data Injection
Attack against Automatic Generation Control in Power Grids." In 2016 ACM/IEEE 7th
International Conference on Cyber-Physical Systems (ICCPS), pp. 1-10. IEEE, 2016.
16. Solar Energy Grid Integration System (SEGIS) 2007.
https://www1.eere.energy.gov/solar/pdfs/segis_concept_paper.pdf. Accessed in Aug 2016.
17. AlMajali, A., & Dweik, W. (2016, December). Modeling worm propagation in the
advanced metering infrastructure. In Electronic Devices, Systems and Applications
(ICEDSA), 2016 5th International Conference on (pp. 1-4). IEEE.
217
18. Yan, Y., Qian, Y., Sharif, H., & Tipper, D.: A survey on cyber security for smart grid
communications. Communications Surveys & Tutorials, IEEE,14(4), 998-1010. (2012).
19. Cá rdenas, A. A., Amin, S., Schwartz, G. A., Dong, R., & Sastry, S. 2012. A game theory
model for electricity theft detection and privacy-aware control in AMI systems. In
Communication, Control, and Computing (Allerton), 50th Annual Allerton Conference
(pp. 1830-1837). IEEE.
20. Albadi, M. H., & El-Saadany, E. F. (2008). A summary of demand response in electricity
markets. Electric power systems research, 78(11), 1989-1996.
21. AlMajali, A., Rice, E., Viswanathan, A., Tan, K., and Neuman, C.: A systems approach to
analyzing cyber-physical threats in the Smart Grid. In Smart Grid Communications
(SmartGridComm), IEEE International Conference on pp. 456-461. (2013)
22. Amini, S., Mohsenian-Rad, H., & Pasqualetti, F. (2015, February). Dynamic load altering
attacks in smart grid. In Innovative Smart Grid Technologies Conference (ISGT), 2015
IEEE Power & Energy Society (pp. 1- 5). IEEE.
23. Wood, Paul, Saurabh Bagchi, and Alefiya Hussain, "Defending against strategic
adversaries in dynamic pricing markets for smart grids." In 2016 8th International
Conference on Communication Systems and Networks (COMSNETS), pp. 1-8. IEEE,
2016.
24. Al Majali, A. (2014). A Function-based Methodology for Evaluating Resilience in Smart
Grids (Doctoral dissertation, University of Southern California).
25. R. Blank, P. Gallagher, NIST special publication 800-30 revision 1 guide for conducting
risk assessments, Tech. rep., Tech. Rep., National Insti- tute of Standards and Technology
(2012).
218
26. W. Stallings, L. Brown, Computer security, Principles and practice (2 nd ed). Edinburgh
Gate: Pearson education limited.
27. Wadhawan, Y., & Neuman, C. (2017, April). Analyzing Cyber-Physical Attacks on Smart
Grid Systems. In Proceedings of the 2017 Workshop on Modeling and Simulation of
Cyber-Physical Energy Systems. CPS Week 2017.
28. Wadhawan, Yatin, Clifford Neuman, and Anas Al Majali. "A Systematic Approach for
Analyzing Multiple Cyber-Physical Attacks on the Smart Grid." In Proceedings of the
International Science Index, Computer and Information Engineering International
Conference on Cyber Security of Cyber Physical Systems, Boston, MA, USA, vol. 12.
2018.
29. Wadhawan, Y., & Neuman, C. (2015). Evaluating Resilience of Oil and Gas Cyber
Physical Systems: A Roadmap. In Annual Computer Security Application Conference
(ACSAC) Industrial Control System Security (ICSS) Workshop.
30. Wadhawan, Y., & Neuman, C. (2016, October). Evaluating Resilience of Gas Pipeline
Systems Under Cyber- Physical Attacks: A Function-Based Methodology. In Proceedings
of the 2nd ACM Workshop on Cyber- Physical Systems Security and Privacy (pp. 71-80).
ACM.
31. Wadhawan, Y., & Neuman, C. BAGS: A Tool for Quantifying Resilience of Smart Grid
Systems. Submitted to International Workshop on Cyber-Physical Systems (IWCPS May
2017).
32. Anas Al Majali, W. Yatin., Neuman, C, Saadeh. Mahmood, Shalalfeh. Laith. “Risk
Assessment of Smart Grids under Cyber-physical Attacks using Bayesian Networks”.
Accepted by International Journal of Electronic Security and Digital Forensics 2019.
33. Wadhawan, Yatin, Anas AlMajali, and Clifford Neuman. "A Comprehensive Analysis of
Smart Grid Systems against Cyber-Physical Attacks." Electronics 7, no. 10 (2018): 249.
219
34. Wadhawan, Yatin, and Clifford Neuman. "RL-BAGS: A Tool for Smart Grid Risk
Assessment." In 2018 International Conference on Smart Grid and Clean Energy
Technologies (ICSGCE), pp. 7-14. IEEE, 2018.
35. Wadhawan, Yatin, Anas AlMajali, and Clifford Neuman. IGNORE: A Policy Server to
prevent Cyber Attacks from Propagating to the Physical Domain.
36. Wadhawan, Yatin, Clifford Neuman, and Anas AlMajali. "PSP: A Framework to Allocate
Resources to Power Storage Systems under Cyber-Physical Attacks." Proceedings of
Proceedings of ICS & SCADA (2018): 57.
37. Wadhawan, Y., & Neuman, C. (2016, August). Defending Cyber-Physical Attacks on Oil
Pipeline Systems: A Game-Theoretic Approach. In Proceedings of the 1st International
Workshop on AI for Privacy and Security (p. 7). ACM.
38. Stamp, J., McIntyre, A., & Ricardson, B. (2009, March). Reliability impacts from cyber
attack on electric power systems. In Power Systems Conference and Exposition, 2009.
PSCE'09. IEEE/PES (pp. 1-8). IEEE.
39. Cá rdenas, A. A., Amin, S., Lin, Z. S., Huang, Y. L., Huang, C. Y., & Sastry, S. (2011,
March). Attacks against process control systems: risk assessment, detection, and response.
In Proceedings of the 6th ACM symposium on information, computer and communications
security (pp. 355-366). ACM.
40. Liu, S., Feng, X., Kundur, D., Zourntos, T., & Butler-Purry, K. L. (2011, October).
Switched system models for coordinated cyber-physical attack construction and
simulation. In Smart Grid Modeling and Simulation (SGMS), 2011 IEEE First International
Workshop on (pp. 49-54). IEEE.
41. Sridhar, S., Hahn, A., & Govindarasu, M. (2012, January). Cyber attack-resilient control
for smart grid. In Innovative Smart Grid Technologies (ISGT), 2012 IEEE PES (pp. 1-3).
IEEE
220
42. Srivastava, A., Morris, T., Ernster, T., Vellaithurai, C., Pan, S., & Adhikari, U. (2013).
Modeling cyber-physical vulnerability of the smart grid with incomplete information. IEEE
Transactions on Smart Grid, 4(1), 235-244.
43. AlMajali, Anas, Arun Viswanathan, and Clifford Neuman. "Resilience evaluation of
demand response as spinning reserve under cyber-physical threats." Electronics 6, no. 1
(2016): 2.
44. Pan, K., Teixeira, A. M., Cvetkovic, M., & Palensky, P. (2016, November). Combined data
integrity and availability attacks on state estimation in cyber-physical power grids. In Smart
Grid Communications (SmartGridComm), 2016 IEEE International Conference on (pp.
271-277). IEEE
45. Lu, D., Liu, Y., & Zeng, Y. (2016, November). Risk assessment of power grid considering
the reliability of the information system. In Smart Grid Communications
(SmartGridComm), 2016 IEEE International Conference on (pp. 723-728). IEEE.
46. Panteli, M., Mancarella, P., Trakas, D., Kyriakides, E., & Hatziargyriou, N. (2017). Metrics
and Quantification of Operational and Infrastructure Resilience in Power Systems. IEEE
Transactions on Power Systems.
47. Alohali, B., Kifayat, K., Shi, Q., & Hurst, W. (2017). Replay Attack Impact on Advanced
Metering Infrastructure (AMI). In Smart Grid Inspired Future Technologies: First
International Conference, SmartGIFT 2016, Liverpool, UK, May 19-20, 2016, Revised
Selected Papers (pp. 52-59). Springer International Publishing.
48. Dabrowski, A., Ullrich, J. and Weippl, E.R., 2017. Grid shock: coordinated load-changing
attacks on power grids. system, 28, p.64. In Proc. ACM ACSAC’17 (Dec. 2017).
49. Soltan, S., Mittal, P. and Poor, H.V., 2018, August. BlackIoT: IoT Botnet of high wattage
devices can disrupt the power grid. In Proc. USENIX Security (Vol. 18).
221
50. Rinaldi, SMN. Modeling and Simulation critical Infrastructure and their interdependencies.
System Science, 2004. Proceedings of the 37th Annual Hawaii International Conference
51. Shahidehpour, M. O. H. A. M. M. A. D., Yong Fu, and Thomas Wiedman, "Impact of
natural gas infrastructure on electric power systems." Proceedings of the IEEE 93, no. 5
(2005): 1042-1056.
52. Manshadi, Saeed D., and Mohammad E. Khodayar. "Resilient operation of multiple energy
carrier microgrids." IEEE Transactions on Smart Grid 6, no. 5 (2015): 2283-2292.
53. Li, Tao, Mircea Eremia, and Mohammad Shahidehpour, "Interdependency of natural gas
network and power system security." IEEE Transactions on Power Systems 23, no. 4
(2008): 1817-1824.
54. Yuan, W., Zhao, L., & Zeng, B. (2014). Optimal power grid protection through a defender–
attacker–defender model. Reliability Engineering & System Safety, 121, 83-89.
55. Erdener, Burcin Cakir, Kwabena A. Pambour, Ricardo Bolado Lavin, and Berna Dengiz,
"An integrated simulation model for analysing electricity and gas systems." International
Journal of Electrical Power & Energy Systems 61 (2014): 410-420.
56. Laprie, J. C., Kanoun, K., and Kaâ niche, M.: Modelling interdependencies between the
electricity and information infrastructures. In Computer Safety, Reliability, and Security,
pp. 54-67. Springer Berlin Heidelberg. (2007)
57. Wu, Baichao, Aiping Tang, and Jie Wu, "Modeling cascading failures in interdependent
infrastructures under terrorist attacks." Reliability Engineering & System Safety 147
(2016): 1-8.
58. Y. Xiang, L. Wang, Y. Zhang, Power system adequacy assessment with probabilistic cyber
attacks against breakers, in: 2014 IEEE PES General Meeting — Conference Exposition,
2014, pp. 1–5. doi:10.1109/PESGM.2014.6939374.
222
59. M. S. Rahman, H. R. Pota, M. J. Hossain, Cyber vulnerabilities on agent-based smart grid
protection system, in: PES General Meeting— Conference & Exposition, 2014 IEEE,
IEEE, 2014, pp. 1–5
60. B. Kang, P. Maynard, K. McLaughlin, S. Sezer, F. Andr ́en, C. Seitl, F. Kupzog, T. Strasser,
Investigating cyber-physical attacks against iec 61850 photovoltaic inverter installations,
in: Emerging Technologies & Factory Automation (ETFA), 2015 IEEE 20th Conference
on, IEEE, 2015, pp. 1–8.
61. X. Liu, M. Shahidehpour, Y. Cao, L. Wu, W. Wei, X. Liu, Microgrid risk analysis
considering the impact of cyber attacks on solar PV and ESS control systems, IEEE
Transactions on Smart Grid 8 (3) (2017) 1330–1339.
62. Li, S., Tryfonas, T., Russell, G., & Andriotis, P. (2016). Risk assessment for mobile
systems through a multilayered hierarchical Bayesian network. IEEE transactions on
cybernetics, 46(8), 1749-1759.
63. Hosseini, S., & Barker, K. (2016). Modeling infrastructure resilience using Bayesian
networks: a case study of inland waterway ports. Computers & Industrial Engineering, 93,
252-266.
64. Poolsappasit, N., Dewri, R., & Ray, I. (2012). Dynamic security risk management using
bayesian attack graphs. IEEE Transactions on Dependable and Secure Computing, 9(1),
61-74.
65. Li, H., Lai, L., & Qiu, R. C. 2011. A denial-of-service jamming game for remote state
monitoring in smart grid. In Information Sciences and Systems (CISS), 2011 45th Annual
Conference on (pp. 1-6). IEEE.
66. Shelar, D., & Amin, S. (2015, July). Analyzing vulnerability of electricity distribution
networks to DER disruptions. In American Control Conference (ACC), 2015 (pp. 2461-
2468). IEEE.
223
67. Srikantha, Pirathayini, and Deepa Kundur, "A DER Attack-Mitigation Differential Game
for Smart Grid Security Analysis." IEEE Transactions on Smart Grid 7, no. 3 (2016): 1476-
1485.
68. Yan, J., He, H., Zhong, X., & Tang, Y. (2017). Q-Learning-Based Vulnerability Analysis
of Smart Grid Against Sequential Topology Attacks. IEEE Transactions on Information
Forensics and Security, 12(1), 200- 210.
69. Nguyen, T.H., Yang, R., Azaria, A., Kraus, S., and Tambe, M. 2013. Analyzing the
effectiveness of adversary modeling in security games. In AAAI.
70. Yang, R., Kiekintveld, C., Ordonez, F., Tambe, M., and John, R. 2011. Improving resource
allocation strategy against human adversaries in security games. In IJCAI Proceedings
International Joint Conference on Artificial Intelligence, volume 22, page 458.
71. Pita, J., Jain, M., Marecki, J., Ordó ñ ez, F., Portway, C., Tambe, M., & Kraus, S. 2008.
Deployed ARMOR protection: the application of a game theoretic model for security at the
Los Angeles International Airport. In Proceedings of the 7th International joint conference
on Autonomous agents and multiagent systems: industrial track (pp. 125-132).
72. Tsai, J., Kiekintveld, C., Ordonez, F., Tambe, M., & Rathi, S. (2009). IRIS-a tool for
strategic security allocation in transportation networks.
73. Kar, D. 2015, et al. "A game of thrones: when human behavior models compete in repeated
Stackelberg security games." Proceedings of the International Conference on Autonomous
Agents and Multiagent Systems.
74. Gholami, S., Wilder, B., Brown, M., Sinha, A., Sintov, N., & Tambe, M. (2016). A Game
Theoretic Approach on Addressing Cooperation among Human Adversaries. Proceedings
of the 15th International Conference on Autonomous Agents and Multiagent Systems
(AAMAS 2016)
224
75. Ayar,M.,Obuz,S.,Trevizan,R.D.,Bretas,A.S.andLatchman,H.A., 2017. A distributed
control approach for enhancing smart grid transient stability and resilience. IEEE
Transactions on Smart Grid, 8(6), pp.3035-3044.
76. Ayar,M.,Trevizan,R.D.,Obuz,S.,Bretas,A.S.,Latchman,H.A.and Bretas, N.G., 2017.
Cyber-physical robust control framework for enhancing transient stability of smart grids.
IET Cyber-Physical Systems: Theory & Applications, 2(4), pp.198-206.
77. Babalola, A.A., Belkacemi, R. and Zarrabian, S., 2018. Real-time cascading failures
prevention for multiple contingencies in smart grids through a multi-agent system. IEEE
Transactions on Smart Grid, 9(1), pp.373-385.
78. Beigi Mohammadi, N., Mišić, J., Mišić, V.B. and Khazaei, H., 2014. A framework for
intrusion detection system in advanced metering infrastructure. Security and
Communication Networks, 7(1), pp.195-205.
79. Ryutov, T., AlMajali, A., & Neuman, C. (2015, April). Modeling security policies for
mitigating the risk of load altering attacks on smart grid systems. In Modeling and
Simulation of Cyber-Physical Energy Systems (MSCPES), 2015 Workshop on (pp. 1-6).
IEEE.
80. Common Vulnerability Scoring System v3. 0: Specification Docu- ment, url:
https://www.first.org/cvss/specification-document. Accessed on August 2018.
81. California Independent System Operator (CAISO).
http://www.caiso.com/Pages/default.aspx. Accessed on 12 March 2019.
82. C. Perkins, E. Belding-Royer, and S. Das, “Ad hoc on-demand distance vector (aodv)
routing,” Tech. Rep., 2003.
83. Pipeline Pressure Limits. http://www.hse.gov.uk/pipelines/resources/pipelinepressure.htm.
Accessed in March 2016.
225
84. Pipeline Control Valves, Southern California Gas Company.
https://www.socalgas.com/documents/news-room/factsheets/PipelineValves.pdf.
Accessed in March 2016.
85. Jawhar, I., Mohamed, N. and Shuaib, K. A framework for pipeline infrastructure
monitoring using wireless sensor networks. In Wireless Telecommunications Symposium,
(2007). WTS 2007 (pp. 1-7). IEEE.
86. Communications Module for Electricity Meters. (2013)
http://www.silverspringnet.com/pdfs/SilverSpring-DatasheetCommunications-
Modules.pdf. Accessed in March 2016
87. Radio Propagation Model used in Ns-2
http://kom.aau.dk/group/05gr1120/ref/Channel.pdf. Accessed in March 2016.
88. These Chinese hackers tricked Tesla’s Autopilot into suddenly switching lanes.
https://www.cnbc.com/2019/04/03/chinese-hackers-tricked-teslas-autopilot-into-
switching-lanes.html Accessed on 29 September 2019.
89. Lan, Q., Zou, Y., & Feng, C.: Cascading Failure of Power Grids Under Three Attack
Strategies. Chinese Journal of Computational Physics, 29(6), 943-948 (2012)
90. Huang, Bing, Alvaro A. Cardenas, and Ross Baldick. "Not Everything is Dark and Gloomy:
Power Grid Protections Against IoT Demand Attacks." In 28th {USENIX} Security
Symposium ({USENIX} Security 19), pp. 1115-1132. 2019.
91. Generic Solar Photovoltaic System Dynamic Simulation Model Specification.
https://www.powerworld.com/files/WECC-Solar-PV-Dynamic-Model-Specification-
September-2012.pdf. Accessed in March 2016.
92. OWASP Security by Design Principles, url: https://www.owasp.org/ index.php/Security by
Design Principles. Accessed in February 2019.
226
93. Introduction to NISTIR 7628 Guidelines for Smart Grid Cyber
Security.https://www.nist.gov/sites/default/files/documents/smartgrid/nistir7628_total.pdf
. Accessed in Jan 2019.
94. L. Munoz Gonzalez, E. Lupu, Bayesian attack graphs for security risk assessment. IST-153
Workshop on Cyber Resilience 2016.
95. Yodo, N., & Wang, P. (2016). Resilience modeling and quantification for engineered
systems using Bayesian networks. Journal of Mechanical Design, 138(3), 031404.
96. AISpace: Tools for learning Artificial Intelligence, http:// aispace.org/bayes/index.shtml
Accessed on February 2019.
97. T. Athay, R. Podmore, S. Virmani, A practical method for the direct analysis of transient
stability, IEEE Transactions on Power Apparatus and Systems (2) (1979) 573–584.
98. M. Pai, Energy function analysis for power system stability, Springer Science & Business
Media, 2012.
99. W. Rahmouni, L. Benasla, Transient stability analysis of the ieee 39-bus power system
using gear and block methods, in: Electrical Engineering- Boumerdes (ICEE-B), 2017 5th
International Conference on, IEEE, 2017, pp. 1–6.
100. M. Cupelli, C. D. Cardet, A. Monti, Voltage stability indices comparison on the ieee-
39 bus system using rtds, in: Power System Technology (POWERCON), 2012 IEEE
International Conference on, IEEE, 2012, pp. 1–6.
101. Rajapakse, F. Gomez, O. Nanayakkara, P. Crossley, V. Terzija, Rotor angle stability
prediction using post-disturbance voltage trajectory patterns, in: Power & Energy Society
General Meeting, 2009. PES’09. IEEE, IEEE, 2009, pp. 1–6.
227
102. Renewable Transient Stability Modeling for Wind and Solar plants,
url:https://www.powerworld.com/knowledge-base/renewable-transient-stability-
modeling-for-wind-and-solar-plants. Accessed in August 2018.
103. P. Kundur, J. Paserba, S. Vitet, Overview on definition and classification of power
system stability, in: CIGRE/IEEE PES International Symposium Quality and Security of
Electric Power Delivery Systems, 2003. CIGRE/PES 2003., IEEE, 2003, pp. 1–4.
104. EPRI, EPRI Power System Dynamics Tutorial, Tech. Rep. 1016042, Palo Alto, CA
(2009).
105. Anthony R. Casandra 2003-2018. Partially Observable Markov Decision Process.
http://www.pomdp.org. Accessed in February 2016.
106. Cassandra, Anthony, Michael L. Littman, and Nevin L. Zhang. "Incremental pruning:
A simple, fast, exact method for partially observable Markov decision processes."
In Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, pp.
54-61. Morgan Kaufmann Publishers Inc., 1997.
107. Oliehoek, Frans, Matthijs TJ Spaan, and Nikos Vlassis. "Best-response play in partially
observable card games." In Proceedings of the 14th annual machine learning conference of
Belgium and the Netherlands, pp. 45-50. 2005.
108. Alpha Guardian. 2017. Energy Storage System
Vulnerabilities.http://www.alphaguardian.net/energy-storage-system-cybervulnerabilities/
Accessed in March 2016.
109. Chen, J., & Zhu, Q. (2017). A game-theoretic framework for resilient and distributed
generation control of renewable energies in microgrids. IEEE Transactions on Smart Grid,
8(1), 285-295.
110. Why do we need pipelines? http://www.pipeline101.com/why-do-we-need-pipelines
Accessed in October 2015.
228
111. Hankin, C. 2016. Game Theory and Industrial Control Systems. In Semantics, Logics,
and Calculi (pp. 178-190). Springer International Publishing.
112. Overview of Cyber Vulnerabilities, ICS-CERT. https://ics-cert.us-
cert.gov/content/overview-cyber-vulnerabilities. Accessed in August 2016.
113. R. Bellman, "A Markovian decision process," Journal of Mathematics and Mechanics,
pp. 679-684, 1957.
114. C. Watkins and P. Dayan. "Q-learning," Machine learning, vol. 8, no. 3-4, pp. 279-292,
1992.
115. Burlap Reinforcement Learning. [Online]. Available: http://burlap.cs.brown.edu/.
Accessed in March 2016.
116. Ghosh, R. Patel, M. Datta, L. Meegahapola, Investigation of transient stability of a
power network with solar-pv generation: Impact of loading level & control strategy, in:
Innovative Smart Grid Technologies- Asia (ISGT-Asia), 2017 IEEE, IEEE, 2017, pp. 1–6.
117. E. Munkhchuluun, L. Meegahapola, A. Vahidnia, Impact on rotor angle stability with
high solar-pv generation in power networks, in: Innovative Smart Grid Technologies
Conference Europe (ISGT-Europe), 2017 IEEE PES, IEEE, 2017, pp. 1–6.
118. Othmane, L.B., Weffers, H., Mohamad, M.M. and Wolf, M., 2015. A survey of security
and privacy in connected vehicles. In Wireless sensor and mobile ad-hoc networks (pp.
217-247). Springer, New York, NY.
119. T. Hoppe, S. Kiltz, and J. Dittmann, “Security threats to automotive CAN networks-
practical examples and selected short-term counter-measures,” in Proc. Comput. Safety,
Rel., Security, 2008, pp. 235–248.
229
120. K. Koscher et al., “Experimental security analysis of a modern automobile,” in IEEE
Symp. Security Privacy, May 2010, pp. 447–462.
121. S. Checkoway et al., “Comprehensive experimental analyses of automotive attack
surfaces,” in Proc. 20th USENIX SEC, 2011, pp. 1–16.
122. Conference Name:ACM Woodstock conference Autnomous Vehicle Breaks. What is
it and how does it work? https://www.astrobrake.co.za/autonomous-vehicle-brakes-work/.
Accessed in January 2018.
123. U.S.-CANADA Power System Outage Task force. Report on the August 14, 2003
blackout in the United States and Canada: Causes and recommendations.
https://energy.gov/sites/prod/files/oeprod/ Docu- mentsandMedia/BlackoutFinal-Web.pdf.
Accessed in January 2018.
124. India: Report on the Grid Disturbance on 30th July 2012 and 31st July 2012.
http://www.cercind.gov.in/2012/orders/Final_Report_Grid_Disturbance.pdf . Accessed in
January 2019.
125. Introduction to NISTIR 7628 Guidelines for Smart Grid Cyber Security.
https://www.nist.gov/sites/default/files/documents/smartgrid/nistir-7628_total.pdf.
Accessed in January 2019.
126. IEEE 9-bus Model. http://icseg.iti.illinois.edu/wscc-9-bus-system/. Accessed in August
2018
127. U.S. Energy Information Administration, “Electric Sales, Revenue, and Average
Price,” [Online]. Available: http://www.eia.gov/electricity/sales_revenue_price/.
Accessed in August 2018.
128. RTDS. https://www.rtds.com/applications/cyber-security/. Accessed in Aug 2019.
230
129. Focal Report 7: CIP Resilience and Risk Management in Critical Infrastructure
Protection Policy: Exploring the Relationship and Comparing its Use. Accessed on 1
September 2019. https://www.files.ethz.ch/isn/164305/Focal-Report-7-SKI.pdf
130. Bishop, Matthew A. "The art and science of computer security." (2002).
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A function-based methodology for evaluating resilience in smart grids
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Game theoretic deception and threat screening for cyber security
PDF
Model-driven situational awareness in large-scale, complex systems
PDF
Theoretical foundations and design methodologies for cyber-neural systems
PDF
Distributed adaptive control with application to heating, ventilation and air-conditioning systems
PDF
Improving network security through cyber-insurance
PDF
Prediction models for dynamic decision making in smart grid
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Novel and efficient schemes for security and privacy issues in smart grids
PDF
Discrete optimization for supply demand matching in smart grids
PDF
A system for trust evaluation and management leveraging trusted computing technology
PDF
Assume-guarantee contracts for assured cyber-physical system design under uncertainty
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
Distribution system reliability analysis for smart grid applications
PDF
Handling attacker’s preference in security domains: robust optimization and learning approaches
PDF
Modeling human bounded rationality in opportunistic security games
PDF
A complex event processing framework for fast data management
PDF
Data-driven and logic-based analysis of learning-enabled cyber-physical systems
PDF
Cyberinfrastructure management for dynamic data driven applications
Asset Metadata
Creator
Wadhawan, Yatin
(author)
Core Title
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/05/2019
Defense Date
09/23/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cyber attack,cyber security,cyber-physical system,game-theory,industrial control system,OAI-PMH Harvest,power grid,security systems,smart grid
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Neuman, Clifford (
committee chair
), Halfond, William G.J. (
committee member
), Prasanna, Viktor (
committee member
)
Creator Email
wadhawanyatin@gmail.com,ywadhawa@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-245009
Unique identifier
UC11674744
Identifier
etd-WadhawanYa-8003.pdf (filename),usctheses-c89-245009 (legacy record id)
Legacy Identifier
etd-WadhawanYa-8003.pdf
Dmrecord
245009
Document Type
Dissertation
Rights
Wadhawan, Yatin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cyber attack
cyber security
cyber-physical system
game-theory
industrial control system
power grid
security systems
smart grid