Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Memorable, secure, and usable authentication secrets
(USC Thesis Other)
Memorable, secure, and usable authentication secrets
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MEMORABLE, SECURE, AND USABLE AUTHENTICATION SECRETS
by
Simon S. Woo
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2017
Copyright 2017 Simon S. Woo
Dedication
To my parents, grandmother Monica, and Olivia
ii
Acknowledgments
During my Ph.D. studies, many people supported, encouraged, and enlightened me, and
they deserve my sincere gratitude and appreciation.
First, I would like to express my deepest thankfulness and appreciation to my advi-
sor, Professor Jelena Mirkovic, for her support, instruction, and encouragement through-
out my entire time as a Ph.D. student. I am thankful that she accepted me as her student,
and gave me an opportunity to work under her guidance, all the while providing insight-
ful suggestions and sharing her experience and knowledge in many dierent areas. Her
knowledge and passion for scientific research truly inspired me. I am honored to have
had the opportunity to be her student; under her guidance I learned to sharpen my critical
thinking and problem-solving abilities, as well as my teamwork skills. I know that the
the abilities I acquired will be a lifelong asset. Professor Mirkovic has been a tremen-
dous role model for me; like her, I will strive to be a good researcher, professor, adviser,
and mentor. I appreciate all the time and ideas she contributed towards my research, as
well as the support, advice, and motivation that she provided during the tough times of
my Ph.D journey.
I want to thank Professors Ron Artstein and Elsi Kaiser who, for more than three
years, provided valuable feedback and advice during weekly meetings on my dierent
research projects. I truly appreciate the time and the insightful guidance they provided
from their dierent areas of expertise. I would also like to thank Professor Kevin Knight
iii
for providing instruction in the area of data wrangling, in addition to the many other
exciting topics in his class – such as acquiring critical skills in natural language process-
ing. From his class and NLP seminar, I was able to expand my limited knowledge and
widen my interests.
I also want to thank Professor Aleksandra Korolova for taking the time out of her
busy schedule to serve on my qualifying exam committee, and Professors Cli Neuman
and David Morgan who gave me the opportunity to TA for the security system course
(which ignited me to further pursue my Ph.D in computer security). Additionally, I
would like to extend my sincerest thanks and appreciation to Professor Bart Kosko in
Viterbi’s EE Department, who provided clear answers to some of my questions on infor-
mation theory, probability, statistics, and neural networks, among many others, and to
Professors Brent Kang at KAIST and Jason Hong at CMU for giving invaluable feed-
back on some of my research work.
I thank my fellow friends and colleagues at the STEEL group, USC and ISI for their
friendship and help: Abdulla Alwabel, Hao Shi, Xiyue Deng, Sangwon Lee, Haeran
Jeon, Wenzhe Li, Liang Zhu, Hang Guo, Beomjun Kim, Jingul Kim, Duoduo Yu, Aliya
Deri, and many others. They have been most supportive, and we established friendships
that contributed immensely to my professional and personal time during the entire PhD
experience. In addition, I thank the M.S. students who worked with me on various
projects; they include Ameya Hanamsagar, Nan Yang, and Le Xiao. Ameya’s and Le’s
contributions were especially valuable, and it has been my great pleasure to work with
these students. I am also grateful to Joe Kemp, Alba Regalado and Jeanine Yamazaki
for their assistance on administrative tasks at ISI, and to Janice Wheeler for oering her
time to read my thesis and provide helpful suggestions.
Finally, I thank my family for their love and encouragement, and for standing by me
when I was going through dicult times. I give special thanks to my parents and my
iv
sister for their patience and faithful support during my Ph.D program. Finally, I thank
God, my grandmother Monica, and Olivia who are always in my mind, and who gave
me strong reasons to complete this thesis.
v
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables ix
List of Figures xii
Abstract xiv
1 Introduction 1
1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Demonstrating Thesis Statement . . . . . . . . . . . . . . . . . . . . . 4
1.3 Structure of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 4
2 Performance Metrics 6
2.0.1 Attacker Models . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.0.2 Password Strength . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Understanding Password Habits and Their Causes 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Password Reuse . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Password Structures . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 User’s Perception on Passwords . . . . . . . . . . . . . . . . . 17
3.2.4 Causes of Weak Passwords . . . . . . . . . . . . . . . . . . . . 17
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Detecting Sites With Participant Accounts . . . . . . . . . . . . 23
3.3.2 Password Extraction . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Semantic Transformation . . . . . . . . . . . . . . . . . . . . . 24
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vi
3.4.1 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Limitations and External Validity . . . . . . . . . . . . . . . . 27
3.4.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.4 Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.5 Why Is Reuse Prevalent? . . . . . . . . . . . . . . . . . . . . . 36
3.4.6 Password Strength . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.7 Why Are Passwords Weak? . . . . . . . . . . . . . . . . . . . 39
3.5 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4 Life Experience Passwords (LEPs) 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 LEP Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Topics and Facts . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.3 Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.4 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.5 Uses of LEPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5.1 Participant Statistics . . . . . . . . . . . . . . . . . . . . . . . 70
4.5.2 LEPs Are Memorable and Secure . . . . . . . . . . . . . . . . 70
4.5.3 LEPs Are Strong Against Guessing . . . . . . . . . . . . . . . 75
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Mnemonics Passphrase (MNPass) 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Mnemonics and Passphrases . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.1 Passphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Mnemonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.3 Using Mnemonics . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4 Attacker Models and Strength . . . . . . . . . . . . . . . . . . . . . . 84
5.4.1 Attacker Models . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.2 Strength Against Attacks . . . . . . . . . . . . . . . . . . . . . 85
5.5 Passphrase Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5.1 Baseline: User-Chosen Passphrases . . . . . . . . . . . . . . . 90
5.5.2 Improve Recall: Authentication Hints . . . . . . . . . . . . . . 90
5.5.3 Improve Security: Mnemonic-Guided Passphrase Creation . . . 91
5.5.4 Improve Security: System-Chosen Passphrases . . . . . . . . . 92
5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
vii
5.6.1 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.6.2 Mnemonics Improve Recall . . . . . . . . . . . . . . . . . . . 97
5.6.3 Mnemonics Improve Security Against Phrase-Dictionary Attacks 99
5.6.4 Users Like Mnemonics . . . . . . . . . . . . . . . . . . . . . . 102
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6 Semantically-Guided Password Generation (GuidedPass) 104
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . 106
6.3 Pasword Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3.2 Password Composition . . . . . . . . . . . . . . . . . . . . . . 109
6.3.3 Semantic Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.4 How to make passwords stronger . . . . . . . . . . . . . . . . 114
6.4 Designing Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4.1 Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.5.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.5.2 User Study Design . . . . . . . . . . . . . . . . . . . . . . . . 120
6.5.3 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5.4 Limitations and Ecological Validity . . . . . . . . . . . . . . . 121
6.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.6.1 Participant Statistics . . . . . . . . . . . . . . . . . . . . . . . 122
6.6.2 Password Statistics . . . . . . . . . . . . . . . . . . . . . . . . 123
6.6.3 Recall after two days . . . . . . . . . . . . . . . . . . . . . . . 123
6.6.4 Creation Time . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6.5 Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6.6 Pattern Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.6.7 User Sentiment . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7 Conclusion 133
Bibliography 134
viii
List of Tables
2.1 Strength metric used in each approach . . . . . . . . . . . . . . . . . . 8
2.2 Recall intervals used in dierent user studies . . . . . . . . . . . . . . . 9
2.3 Definition of similarity metric . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Survey instruments we used for pre- and post-study . . . . . . . . . . . 19
3.2 Survey instruments we used for risk perception pre-study . . . . . . . . 20
3.3 Patterns in an e-mail message that indicate that a user may have an
account on the sender’s site. . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Account types in my study . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Login success rates per account . . . . . . . . . . . . . . . . . . . . . 28
3.6 Password reuse: percentage of participants that reuse in a given way. . . 33
3.7 Password reuse: percentage of participants that reuse in a given way. . . 37
4.1 LEP topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Fact categories and their statistical and brute-force strengths (see Table 4.3
for sources) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Popular lists, their sizes, and sources . . . . . . . . . . . . . . . . . . . 58
4.4 Participant Statistics and Results of My Studies . . . . . . . . . . . . . 69
4.5 Fact recall and guess success per category . . . . . . . . . . . . . . . . 72
5.1 Passphrase models considered in my work. . . . . . . . . . . . . . . . . 90
ix
5.2 Basic statistics on passphrases per passphrase model . . . . . . . . . . 93
5.3 User recall three days and seven days after passphrase creation . . . . . 97
5.4 Passphrase strength against LM attacker per passphrase model. For LM
attacker, I show the estimated strength guess number. . . . . . . . . . . 100
5.5 Passphrase strength against phrase-dictionary attacker per passphrase
model. For the phrase-dictionary attacker, I show the percentage of
passhphrases, which have a given number of words overlapping with
a common phrase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1 Memorable password dataset, categorized into three dierent strength
groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 Percentage of 3class8 passwords and password length statistics . . . . . 109
6.3 Average, median and std. of number of symbols, digits and uppercase
letters and number of class changes for passwords in each strength type 111
6.4 Percentage of unique semantic pattern . . . . . . . . . . . . . . . . . . 112
6.5 Top 10 semantic patterns of weak and med. strength group . . . . . . . 112
6.6 Top 10 semantic patterns of strong strength group . . . . . . . . . . . . 113
6.7 Number of the semantic tags in each password type . . . . . . . . . . . 113
6.8 Comparisons of generated suggestions . . . . . . . . . . . . . . . . . . 117
6.9 Password creation approaches . . . . . . . . . . . . . . . . . . . . . . 119
6.10 Total number of participants who created and authenticated . . . . . . . 122
6.11 Average, median and STD of password length . . . . . . . . . . . . . . 123
6.12 Recall rate after 2 days. . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.13 Average password creation time . . . . . . . . . . . . . . . . . . . . . 125
6.14 Median guess number, measured using 2 gram, 3 gram and Monte
Carlo back-o. The last column shows the median number of guesses
among the lowest guess values produced by the three methods. . . . . . 126
6.15 Overall suggestions employed by users . . . . . . . . . . . . . . . . . . 129
x
6.16 Average ratings of “this approach was easy to use” claim on Likert 1–10
scale, with 10 being the strongest agreement and 1 being the strongest
disagreement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.17 Survey responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
xi
List of Figures
3.1 User study flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 User ratings for PS3–PS6 questions. . . . . . . . . . . . . . . . . . . . 29
3.3 User ratings for R1–R20 questions. . . . . . . . . . . . . . . . . . . . 30
3.4 Number of accounts per participant as estimated by the participant (sub-
jective), and as measured in my study (objective). . . . . . . . . . . . . 31
3.5 Password reuse patterns per user: 4 users reuse passwords that are of
similar strength, 3 reuse only stronger passwords, 17 only weaker pass-
words and 25 reuse both stronger and weaker passwords. . . . . . . . . 35
3.6 Distribution of password strength for important and non-important sites 39
3.7 Intended and actual password strategies, showed jointly with password
length and strength. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Security questions versus LEP example for “high school” topic. . . . . 50
4.2 LEP creation and authentication . . . . . . . . . . . . . . . . . . . . . 55
4.3 LEP input methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Dierent passphrase creation and auth. methods using hint- and guide-
mnemonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1 Empirical CDF of password length . . . . . . . . . . . . . . . . . . . 110
6.2 Empirical PDF of password length . . . . . . . . . . . . . . . . . . . . 110
6.3 Empirical CDF of the number of semantic tags . . . . . . . . . . . . . 113
6.4 Empirical PDF of the number of semantic tags . . . . . . . . . . . . . 114
xii
6.5 Password strength with 2-gram model . . . . . . . . . . . . . . . . . . 127
6.6 Password strength with 3-gram model . . . . . . . . . . . . . . . . . . 127
6.7 Password strength with back-o model . . . . . . . . . . . . . . . . . . 128
6.8 Optimal attacker chooses min from (2-gram, 3-gram, and back-o) . . . 128
6.9 Empirical CDF of strength improvement between the initial and the final
password. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
xiii
Abstract
Textual passwords are widely used for user authentication, but they are often dicult
for a user to recall, easily cracked by automated programs, and heavily reused. Weak or
reused passwords are responsible for many contemporary security breaches. Hence, it
is critical to study both how users choose and reuse passwords, and the reasons that they
adopt unsafe practices.
In this thesis I first examine the reasons why people create weak passwords and
reuse them over multiple accounts. My research complements the body of existing
work by studying the semantic structure, strength and reuse of real passwords, as well
as conscious and unconscious causes of unsafe practices. To do this, I used a test group
population of 50 participants. Significant reuse and weak passwords clearly demon-
strate the need for alternative authentication methods that are more memorable, secure,
and less reused. My next three key thesis topics focus on developing novel authenti-
cation mechanisms that can directly improve current approaches. The first topic, life-
experience passwords (LEPs), uses a person’s prior life experience as information to
generate more memorable and secure authentication questions. We show that LEPs
significantly raise the level of memorability and security compared to existing pass-
words and security questions. My second topic constructs more memorable and more
secure passphrases through the novel use of mnemonics – multi-letter abbreviations
of passphrases (MNPass) composed of the first letters of each word in a passphrase.
xiv
I apply mnemonics when generating and authenticating passphrases and show that the
mnemonics-based approach improves recall compared to randomly generated passphrases,
and enhances strength compared to user-selected passphrases. My third topic explores
password creation with semantic feedback (GuidedPass). I analyze user-input pass-
words and provide real-time, specific suggestions for improvement based on their exist-
ing semantic structure. A GuidedPass password is 10
7
times stronger than a user’s initial
passwords and has good recall. GuidedPass passwords are also 100 times stronger and
have a 20% higher recall than passwords created with only password-meter feedback.
xv
Chapter 1
Introduction
Textual secrets, such as passwords or passphrases, are widely used for user authentica-
tion. But they also suer from many problems. Users create weak passwords or use
predictable patterns that make guessing attacks easy [RJK13a]. Users also reuse pass-
words at multiple servers; thus, attackers who compromise one server may gain access
to many others with the same credentials [Blo16a, DBC
+
14a]. Approaches that aim to
help users create stronger passwords, such as strict password composition policies and
password meters often cause frustration [IS10], lower memorability, and do not greatly
improve strength [KSK
+
11, UKK
+
12, UAA
+
17, JYW
+
15].
My work takes a dierent approach, observing that requirements of user authenti-
cation for complex and unique secrets are at direct odds with human cognitive abilities
to create and accurately recall such secrets. First, I conducted a study of users’ pass-
word habits, investigating how users balance the trade-o between memorability and
security. Existing literature on this topic is limited as it usually studies bad password
habits and their causes in isolation, which makes it hard to evaluate real user trade-os.
My work complements these existing eorts. I use information from the semantic study
of real passwords to study strength and reuse of passwords, and conscious and uncon-
scious causes of unsafe practices, in a population of 50 participants. The participants
took part in a carefully designed, IRB-approved lab study, where I collected semantic
representations of the subjects’ existing online credentials, and interviewed them about
their password strategies and risk perceptions. My findings show that users often have
1
trouble keeping accurate track of their password reuse, and tend to underestimate it. Fur-
ther, they understand how to compose strong passwords, but make them too short, and
they have mental models of attackers that are too weak. In a large number of cases users
resort to bad practices (weak passwords, verbatim reuse) driven by a desire to maximize
recall. I provide recommendations for tools that inform users of their password practices
and oer automated remediation.
After learning that memorability plays a crucial role in how users choose pass-
words, I developed three pieces of work that seek to preserve inherent memorability
in user-chosen passwords and passphrases, while improving their security. In the first
work, I proposed life-experience passwords (LEPs) – a new password design where
the authentication secret consists of several facts about a user-chosen past experience,
such as a trip, a graduation, a wedding, etc. At LEP creation, the system extracts these
facts from the user’s input and transforms them into questions and answers. At authen-
tication, the system prompts the user with questions and matches her answers to the
stored ones. I proposed two LEP designs, and evaluated them via user studies. I further
compared LEPs to passwords, and found that: 1) LEPs are up to 30–47 bits stronger
than an ideal, randomized, 8-character password, 2) LEPs are up to 3 more memo-
rable, and 3) LEPs are reused half as often as passwords. While both LEPs and security
questions use personal experiences for authentication, LEPs use several questions which
are closely tailored to each user. This increases LEP security against guessing attacks.
In our evaluation, only 0.7% of LEPs were guessed by casual friends and 9.5% by fam-
ily/close friends, while prior research found that friends could guess 17–25% of security
questions. On the downside, LEPs take 6.7 times longer to create, and 3.3 times longer
to authenticate, and thus are not well-suited for everyday authentication tasks.
In the second work, I improved memorability and security of passphrases
(MNPass). Passphrases are regarded as more secure than passwords because they are
2
longer than passwords. Yet, users use predictable word patterns and common phrases
to make passphrases memorable, which in turn significantly lowers security. I explored
a novel use of mnemonics, multi-letter passphrase abbreviations, to make passphrases
more memorable and more secure. I use mnemonics during authentication as user hints
to achieve cued-recall. I also explored the use of mnemonics to guide passphrase cre-
ation – we generate a random mnemonic and require a user to produce a passphrase that
matches it. This guides the users away from common phrases and improves security. I
evaluated these uses of mnemonics in several IRB-approved user studies with partici-
pants from Amazon Mechanical Turk. I find that mnemonics displayed as authentication
hints increases recall of passphrases by 30–36% after three days, and by 51–74% after
seven days. When accustomed to guiding passphrase creation, mnemonics reduces the
use of common phrases from 52% to under 5%, while passphrase recall remains high.
Users also rate usability of passphrases with mnemonics (for creation or for authentica-
tion) higher than usability of classical passphrases.
In my third piece of work, I improved security of passwords (GuidedPass) by
providing real-time suggestions during password creation. Current password feedback,
such as meters, does not provide actionable information to users about what should be
modified and how to improve password strengths. I first analyzed a dataset of 3,260
passwords, which I collected in prior stages of work, to identify those that were mem-
orable (successfully recalled by users). I then investigated their strength and semantic
structure to identify those patterns that lead to highly secure passwords. I used this infor-
mation to devise specific structural suggestions to users during password creation that
would guide them towards less common structures and content. I tested my approach
versus the new NIST suggestions, CMU suggestions described in [UAA
+
17], and a
password-meter-only feedback. My approach improves the strength of initial user-input
passwords 10
7
times, while preserving high memorability. It further outperforms CMU
3
suggestions with results that have 10% more memorable passwords (80% versus 70%)
because it better preserves user-intended password structure. Finally, my approach out-
performs new NIST and meter-only approaches, both in memorability and in security.
1.1 Thesis Statement
In the presence of clear tension between memorability and security of existing authen-
tication mechanisms, I provide new and proposed improvements to existing authentica-
tion mechanisms that lead to authentication secrets that are both memorable, secure, and
also highly usable.
1.2 Demonstrating Thesis Statement
This thesis asserts that it is possible to better balance memorability and security require-
ments for user authentication. I demonstrate this through four pieces of work. The first is
the study of user password habits, which highlights and quantifies the trade-o between
memorability and security. The second, LEPs, creates new authentication secrets that
are more memorable and secure than passwords. The third, MNPass, uses mnemonics
to improve memorability and security of passphrases. The fourth, GuidedPass, pro-
poses semantically guided password generation, which helps users create more secure
passwords from their initial input, while preserving high memorability.
1.3 Structure of the Dissertation
This dissertation is organized as follows: Chapter 2 discusses how I measure perfor-
mance of dierent authentication approaches, namely the strength of authentication
secrets, their recall and their reuse. Chapter 3 provides the details of the user study
4
to better understand users’ password habits. I also present the findings on users’ pass-
word creation strategy and password versioning, and reuse data collected from their
real accounts. Chapter 4 presents the design of LEPs and their performance, com-
pared to passwords and security questions. Chapter 5 describes the use of mnemonics
to help users create memorable yet strong passphrases. I compare the performance
of this approach with system-generated and user-created passphrases, and find that
my approach has the best of both worlds: its memorability is comparable to that of
user-chosen passphrases, and its strength is comparable to that of system-generated
passphrases. Chapter 6 analyzes characteristics of memorable and secure passwords and
proposes a system that provides proactive and constructive real-time feedback to users
during password generation. This guidance greatly improves password strength, while
preserving high memorability. I summarize my findings and discuss lessons learned and
directions for future work in Chapter 7.
5
Chapter 2
Performance Metrics
This chapter will provide details on how I measure performance of dierent authentica-
tion approaches.
2.0.1 Attacker Models
To discuss how to measure strength, we first have to consider the dierent attacker
models. In regard to how an attacker chooses password guesses, there are two methods:
brute-force and statistical guessing. A brute-force attacker will try every possible pass-
word (given a password policy) in no particular order. A statistical-guessing attacker
will have some notion as to which passwords (or password segments) people use more
than others, and will try password guesses from most to least probable. Often, there
is no comprehensive way to completely order candidate passwords. Further, in some
populations some passwords may be more popular than others, and the attacker does not
know this a priori. Thus, any statistical guessing algorithm is necessarily a heuristic,
and two dierent algorithms may end up having two very dierent numbers of guesses
until they successfully guess a given password.
In regard to how many guesses an attacker can make, there are online and oine
attacks. In online attacks, the attacker makes some number of guesses using the login
dialogue at the server. After some number of unsuccessful guesses, the server will
usually lock down the user’s account. In oine attacks, the attacker has compromised
the server and has retrieved one-way hashes of all passwords. He now tries to guess
inputs to the one-way hash so that the output matches the password hashes that he has.
6
In both online and oine attacks, the attacker may be going after a specific user’s
account, or after any account. We will assume the first; i.e., we will say a user is pro-
tected if her account’s password is strong even if another user’s password can be cracked
by the attacker on the same server.
2.0.2 Password Strength
The strength of a password is the expected number of guesses a given attacker would
take until achieving success; this is also known as a “guess number.”
The Shannon entropy [Sha01] was used to measure uncertainty of information or the
possible size of password space [Gui06]. Although the Shannon entropy can estimate
the strength of a password against brute-force attacks, it fails to accurately model resis-
tance to statistical guessing attacks [Mas94]. Massey [Mas94] was the first to show that
the average number of guesses required with an optimum strategy until one correctly
guesses the value of a discrete random variable is grossly overestimated by the Shannon
entropy. Instead, researchers have proposed dierent ways to model strength against the
statistical-guessing attacker [Bon12a, KKM
+
12, JYW
+
15, Pli].
In my work, wherever possible, I used the Monte-Carlo based approach by
Dell’Amico and Filippone [DF15] to calculate strength against statistical-guessing
attacks. This approach estimates the number of guesses until success using a sampling
method over a probabilistic password model. They have shown the accuracy of such
estimated strength against state-of-the-art attacks. In calculating a passphrase strength
in MNPass, we used the Monte-Carlo method trained on a large phrase and sentence
corpora, instead of trained on passwords.
In some parts of my work the probability distribution across passwords was
unknown. For example, in our study of user password habits we did not have access
to all the users’ actual passwords at the same time, because this would be against our
7
privacy goals. This prevented us from deriving a probability distribution across pass-
words. In these cases, I used a statistical guess estimate from zxcvbn on each original
password (before it was deleted due to study procedures).
Finally, LEP work took place before the Monte-Carlo method for strength estimation
was published. For this reason, I used the PARS tool by Ji et al. [JYW
+
15] to estimate
statistical strength of passwords. I used my own method to estimate probabilities of
LEPs and their statistical strength. Further, in some parts of my work I report both the
brute-force strength and the statistical-guessing strength.
The following Table 2.1 shows the strength metrics that were used for dierent
aspects of the research.
Approach Strength Metric
Study of users’ password habits Statistical guessing strength via zxcvbn
LEPs Brute-force strength &
Statistical guessing strength via PARS tool (passwords)
Statistical guessing strength via my own method (LEPs)
MNPass Statistical guessing strength via Monte-Carlo method
GuidedPass Statistical guessing strength via Monte-Carlo method
Table 2.1: Strength metric used in each approach
2.1 Recall
I measured the successful recall rate after a participant’s authentication secret creation.
Ideally, we would measure recall at dierent intervals (some short, some longer) to
accurately estimate recall of any password. However, there are some challenges to this
approach in the online user study, where attrition rate is high. I generally observed that
only 60% to 70% of participants return from one task to the other. I also observed that
the participants who do not recall a password in one authentication task will not return
8
to the next because they feel they cannot perform the task – even though I promised
payment regardless of authentication success.
In order to cope with these issues, after two days I performed the recall task
only once for GuidedPass; this is a commonly used duration for authentication
tasks [UAA
+
17, KSC
+
, SKK
+
12], and it lets me compare the recall in my approach
with competing approaches. In the MNPass work, I performed the additional five-day
recall task to evaluate a longer-term recall performance. In order to make a fair compar-
ison, I only report results from participants who completed both authentication tasks. In
the LEP work, I used two dierent recall intervals: 1 week and 3-6 months. The first
authentication visit was used to estimate short-term recall. I ran the second authentica-
tion task months later to estimate long-term recall. All recall measurement approaches
that I used are summarized in Table 2.2.
Also, I further implemented detection of copy/paste actions, and I excluded any
participant who used these (for whatever reason) from each study. While other
researchers [UAA
+
17,KSC
+
,SKK
+
12] allow users to store or copy passwords, I believe
this defeats the purpose of researching memorable and usable passwords. Hence, I dis-
couraged users from using copy or paste. I also reminded the participants multiple
times to rely on their memory only. If any cheating occurred, it was likely to aect all
the approaches that we evaluated equally, and therefore should not change the relative
order of their performance.
Approach Recall interval after creation No. of trials Cheating prevention
LEPs 1 week & 3-6 months min. 3 trials yes
MNPass 2 days & 5 days 5 trials yes
GuidedPass 2 days 5 trials yes
Table 2.2: Recall intervals used in dierent user studies
9
2.2 Reuse
In LEPs I investigated how many of a user’s passwords were suciently similar to each
other. However, LEPs are authenticated based on fact matches, and passwords are based
on the exact string match. Thus, it is hard to devise a similarity measure that applies
equally well to both concepts.
To define the similarity of two passwords, I used the approach encoded in the Linux
Pluggable Authentication Modules (PAM) [PAM]. I decide that two passwords p1 and
p2 are similar if more than 1/2 of items in p2 also appear in p1. For passwords, items
are characters and for LEPs, items are facts. This definition allows me to directly apply
pam cracklib to detect similar passwords. I say that op1 is similar to op2 if at least
one of the following conditions hold: 1) more than 1/2 of op1’s characters appear in
op2; 2) op1 is a palindrome of op2; 3) op1 is op2 rotated; 4) op1 diers from op2 only
in case.
In the study of users’ password habits with users’ real passwords, I refine the pass-
word similarity metric to better understand versioning of passwords from the same user.
I say that two passwords are similar if they have at least one common segment (three
or more characters long) and at least one dierent segment. This allows me to better
understand the underlying structure of passwords. Table 2.3 summarizes all the similar-
ity metrics I used.
Approach Password/Passphrase reuse metric
Study of users’ password habits At least one common segment (3 or more charac-
ters long) and at least one dierent segment
LEPs More than half of answers overlap
More than half of characters overlap
Table 2.3: Definition of similarity metric
10
Chapter 3
Understanding Password Habits and
Their Causes
3.1 Introduction
Current advice for password creation – to create a strong, unique password for every
online account – is unrealistic. Users have many online accounts, and human memory
is limited and ill-suited to remember many dierent, unrelated, and complex passwords.
When this policy is inevitably broken by users, what do users do instead? How do users
balance security, memorability, and convenience for their personal online accounts? A
perfect storm of vulnerability exists at the intersection of extensive password reuse, the
hundreds of millions of passwords breached at sites like Yahoo [Phi16], and the attacks
which exploit password reuse to gain access to sensitive data [Blo16b, CL16, Goo16a].
Researchers [UNB
+
15] have conducted lab-based user studies, where users are asked
to create passwords for artificial online accounts in episodic scenarios. However, inves-
tigating these factors in a lab setting is dicult and limited in several ways: one must
both deeply analyze the real, everyday passwords of multiple users, as well as the atti-
tudes, strategies, and concerns of these users with respect to their security. To the best
of my knowledge, simultaneously studying both user actions regarding real-world pass-
word use and user attitudes has not been previously attempted. Performing an integrated
investigation of these components of password choice and use has the potential to create
a better understanding of the security behaviors that people currently follow, which can
11
in turn help researchers design password policies or alternative authentication methods
that are better aligned with user capabilities.
Many researchers have studied password choices and reuse in leaked
datasets [DBC
+
14b, WADMG09, RJK13b, VCT14b, Bon12b] or via browser plug-
ins [DC07, WRBW16]. While this data is valuable for showing wide-spread user
trends, it cannot be used to understand why users make unsafe choices, and thus
cannot inform interventions. Other researchers have studied password choices in a
lab setting [UNB
+
15, SB14], where participants created and narrated their choice of
passwords for fictional servers. But users may be biased towards security in these are
artificial conditions, and may exhibit and describe behaviors that do not align with their
actual practices.
Open questions that we seek to address are:
1) How often are passwords reused in a slightly changed form (e.g., by adding a
number), and what are frequent methods for password versioning? What are the causes
of password reuse? Answers to these questions can help focus user education and inform
more realistic attack models.
2) What factors in a user’s intent, risk perception, a site’s importance to a user and
a site’s password policy influence a user’s password choice? Which factors improve
password strength? Answers to these questions can help focus interventions and user
education.
3) How well do users understand their password practices? Does their practice align
with their intent? Answers to these questions can help create tools that improve user
understanding and help them make more informed choices.
Studying these open questions is hard, as it requires access to real passwords, along
with interviews with their owners. It is challenging to design a study that produces such
data without jeopardizing user safety.
12
My first contribution lies in the novel study methodology that collects information
about a user’s real password semantic structure and allows us to detect password ver-
sioning, while protecting user privacy. I ask users to log into sites where they already
have an account, and then explain password choices and answer questions about their
perceptions. My study infrastructure automatically extracts a password’s semantic struc-
ture, length and strength, and then transforms the original password using a consistent
but irreversible mapping of semantic segments. I then store the password’s features and
the transformed password, and discard the original password. These actions, together
with no user-identifying information, uniquely enable us to study password structure
and password versioning, along with other measures of password strength and reuse,
while keeping study participants safe. I describe my study design in detail in Section
3.3. The study was reviewed and approved by an Institutional Review Board (IRB).
My second contribution lies in the findings, which I present in Section 3.4.
Although significant technical and policy eorts have been made to improve user aware-
ness and enforce strong passwords, I find that: 1) users are not aware of accounts they
have or of their reuse habits; 2) reuse is prevalent – all users in my study reused pass-
words, 98% verbatim and 2% with slight versioning; 3) users intend to create strong
passwords, and know how to compose them, but do not understand how long pass-
words should be or the abilities of attackers; 4) users often knowingly resort to unsafe
practices because they value memorability over security. My findings also update or
challenge several prior results: 1) I find a median of 40 accounts per user, updating the
Florencio et al. [DC07] estimate of 25 from 2006; 2) I find no significant correlation
between password strength and a user’s risk perception, refuting findings by Creese et
al. in [CHJPW13]; 3) I find that weak and strong passwords are equally reused, in con-
trast to findings by Wash et al. in [WRBW16] that suggest that strong passwords are
reused more often.
13
My third contribution lies in the proposed user interventions, which are motivated
by my findings and presented in Section 3.5. These interventions aim to inform users of
their unsafe practices and to oer automated remediation.
3.2 Background and Related Work
In this section, I present prior research on password creation, reuse patterns, and user
behaviors and risk perceptions with regard to passwords.
3.2.1 Password Reuse
Florencio et al. [DC07] conducted a large scale password reuse study by instrumenting
Microsoft Windows Live Toolbar. The study included half a million users monitored
over a three month period. They found that each user had about 25 accounts and 6.5
passwords, each of which is shared across 3.9 sites. Their results showed that the large
number of weak passwords were heavily re-used. This study was conducted in 2006,
thus the number of online accounts and user password creation and reuse behaviors may
have changed. My study provides an updated estimate of 40 online accounts per user,
and potentially 10 accounts per password.
Recently, Wash et al. [WRBW16] examined the types of passwords that are more
frequently reused. They developed a Web browser plugin to collect user passwords, and
conducted a user study with 134 participants. They found that strong and more fre-
quently used passwords were reused more often. I find an opposite eect in my dataset
– weak and strong passwords are reused comparably often. Wash et al. study cannot
capture details on password versioning and real password strength, nor collect infor-
mation about causes of password reuse. My study contributes these findings. Wash’s
14
study further discovers only those accounts that a user accesses frequently, while I also
discover rarely-accessed accounts, which may explain my dierent findings.
In a lab study, Ur et al. [UNB
+
15] examined password behaviors of 49 users, creating
accounts at three fictitious servers – a bank, an e-mail and a news server. They found that
password reuse is common, and that users are not good at making value decisions about
their online accounts. Due to fictitious nature of accounts, Ur et al. [UNB
+
15] were
able to collect user passwords and examine them for versioning. They found that users
had serious misconceptions that making minor or incremental additions to dictionary
words would result in secure passwords. I validate their findings on my real-passwords
dataset. I further investigate other causes of weak and reused passwords in addition to
users’ misconceptions about password composition. I find that misconceptions about
password length, and desire for memorability are the main cause of weak passwords
among participants.
Das et al. [DBC
+
14b] examined how people reuse and mangle passwords based
on leaked password datasets. The limitation of their study is that 97.75% of the users
considered for password reuse had passwords in only 2 datasets. Thus they cannot
analyze reuse across many sites, by a single participant, while this approach can. Their
study estimated that 43-51% of users reuse the same password across multiple sites,
while I find that 98% of my participants do so.
When users are required to adopt a new password composition policy for a university
account, Shay et al. [SKK
+
10] showed that more than half of participants reported that
they either modified an old password or reused it verbatim. Similarly E. von Zezschwitz
et al. [VZDLH13] found through user interviews that 45% of users reuse passwords
verbatim, while 70% version them. My findings put the number of users that reuse
passwords at 98%. This number climbs to 100% if we include versioned passwords.
15
Florencio et al. [FHVO14b] explored a game-theoretical approach to manage a port-
folio of passwords, given a limited user memory. They outlined an optimal password-
sharing strategy as follows: (1) group strong passwords within accounts with high value
and low probability of compromise; and (2) group weak passwords within accounts of
low value and high probability of compromise. In this work, I adopted “important”
and “non-important” tags for these two categories of accounts, and analyzed user shar-
ing strategies. I found that users share both within and across important/non-important
categories, so they may not readily adopt the proposed strategy.
3.2.2 Password Structures
Several articles [RJK13b,nyt,Goo16b], have pointed out that the people use predictable
password patterns when creating multiple passwords. Weir et al. [WADMG09] formally
modeled and analyzed passwords using probabilistic context-free grammars (PCFG).
Veras et al. [VCT14b] investigated the semantic patterns in the millions of cracked
passwords using WordNet, Part-of-Speech (POS) tagger, and other natural language
processing (NLP) tools. I use Veras et al.’s tool to semantically transform collected
passwords to detect and analyze password versioning by each participant.
Bonneau [Bon12b] demonstrated that many passwords are vulnerable to statistical
guessing due to use of dictionary words or popular password strings. Additionally, Shay
et al. [SKK
+
10] showed that nearly 80% of users based their password on a word or a
name, with special characters added at the beginning or at the end. While I similarly
observed that many users start with a dictionary word and make simple changes to it,
I focus on understanding per-participant habits, and do not study common passwords
across participants.
16
3.2.3 User’s Perception on Passwords
Creese et al. [CHJPW13] examined the relationship between perceptions of risk asso-
ciated with online tasks and password choice. They explored how individual’s percep-
tions of risk vary depending on whether the password user is a security expert or not,
and whether they have experienced some form of attack in the past. They found that
users whose opinion diers from experts’ opinion on 6 chosen questions tended to use
passwords with a smaller keyspace. The size of the keyspace was found to decline in
proportion with the magnitude of the opinions’ dierence. In my work, we repeat their
approach to investigate link between user risk perception and password strength, but
find no such correlation.
Ur et al. [UBS
+
16] investigated the relationship between users’ perceptions of the
strength of specific passwords and their actual strength. They found out that participants
had serious misconceptions about using common phrases as base for their passwords
Also, they showed that there was large variance in participants’ understanding of how
passwords may be attacked. I confirm this tendency of users to underestimate attacker’s
abilities.
3.2.4 Causes of Weak Passwords
Redmiles et al. [RMM16] investigated reasons for selective adoption of broad digital-
security advice by users. They identified four main reasons for user rejection of security
advice: 1) too much marketing information, 2) lack of risk, 3) over-saturation, and 4)
inconvenience. In this study, I found that inconvenience, or desire for memorability,
plays the most important role in password reuse.
17
Coventry et al. [CBBT14] argued that general public does not generally follow best
practice because there is no clarity about required actions. My results support this argu-
ment. I find that participants generally understand how to compose strong passwords
and implement this in practice, but do not understand how password length influences
strength.
3.3 Methodology
In this Section, I describe my research goals and privacy protection goals, and how they
shaped my user study design.
Research Goals. My research goal was to study how users design passwords for
dierent sites, and how and why they reuse their passwords. This necessitated:
Access to information about real passwords and real sites where they are used,
Ability to detect passwords by the same user that are very similar to each other,
Ability to summarize a user’s password practices and ask them to explain specific
cases of weak or reused passwords.
The easiest way to collect this information would have been to ask each user to list all
their accounts and passwords. However, a user may forget where she has created an
account in the past. Hence, I wanted to mine information about potential user accounts
in some automated manner, to supplement a user’s memory.
Further, asking users to give us their passwords directly would have been a great
risk to the users’ privacy. These credentials could potentially be misused by someone
on my team, or they could be stolen by outside attackers who could have compromised
my study server. These considerations led us to formulate the privacy protection goals.
18
Statistics, pre-study
Id Question Type
ST1 How many online accounts do you think you have? Numeric
ST2 Roughly, how many dierent passwords do you think have for all
online accounts?
Numeric
ST3 How many email accounts do you have? Numeric
ST4 Is the Gmail account you are using for this study your primary
email account?
Yes/No
Password strategy, pre-study
Id Question Type
PS1 How do you choose your passwords? Do you think this is a good
practice?
Narrative
PS2 Do you allow your browser to save passwords for you, or use a
password manager?
Narrative
PS3 I do not change my passwords, unless I have to. Likert
PS4 I use dierent passwords for dierent accounts that I have Likert
PS5 When I create a new online account, I try to use a passwd‘ that
goes
Likert
beyond the site’s min requirements
PS6 I include special characters in my password even if it is not
required.
Likert
Impact-reasoning, post-study
Id Question Type
IR1 If a stranger could impersonate me on this site this would bring me
personal or financial harm
Likert
IR2 If a friend/family could impersonate me on this site this would
bring me personal or financial harm
Likert
IR3 If the data from my account became public this would bring me
personal or financial harm
Likert
Password strategy reasoning, post-study
Id Question Type
PR1 What is the reason behind using the same password? Narrative
PR2 Are you concerned that an attacker may obtain your password on
site A and use it to access site B?
Narrative
PR3 If the passwords are not same, but similar, ask why did the user
change the password?
Narrative
Mental model of attackers, post-study
Id Question Type
M1 How many guesses could an attacker make in 1 minute? Numeric
M2 How would an attacker come up with guesses? Narrative
M3 How might an attacker guess a password with an unlimited number
of trials?
Narrative
Table 3.1: Survey instruments we used for pre- and post-study
19
Risk-perception, pre-study
Id Question Type
R1 Online banking is risky Likert
R2 Using Amazon to purchase items using a credit-card is risky Likert
R3 Sending credit card details over email is risky Likert
R4 Using eBay to purchase items using Paypal is risky Likert
R5 Using unsecured WiFi in a coee shop is risky Likert
R6 Downloading and using pirated or cracked versions of software is
risky
Likert
R7 Leaving your car unlocked in city centre multi-story car park is risky Likert
R8 Using social networking sites (e.g. Facebook, LinkedIn) with open
privacy settings is risky
Likert
R9 Using social networking sites (e.g. Facebook, LinkedIn) with closed
privacy settings is risky
Likert
R10 Using photo sharing sites (e.g. Flickr, Instagram) is risky Likert
R11 Geotagging content in Twitter or “Checking-in” to a location on Face-
book / Foursquare is risky
Likert
R12 Opening an email from an unknown sender is risky Likert
R13 Leaving a credit card behind a bar to guarantee a tab is risky Likert
R14 Clicking on a link in an email from an unknown sender is risky Likert
R15 Using online dating services is risky Likert
R16 Flying from the UK to the US is risky Likert
R17 Using a cybercafe is risky Likert
R18 Not updating your operating system (e.g. Windows, Mac OS X) is
risky
Likert
R19 Not updating your web-browser (e.g. Internet Explorer, Firefox,
Google Chrome) is risky
Likert
R20 Not updating other applications (e.g. Adobe PDF reader, Microsoft
Oce / Word, iTunes) is risky
Likert
Table 3.2: Survey instruments we used for risk perception pre-study
Privacy Protection Goals. I aimed to protect privacy of the users whose password
habits I study in the following manner:
No storing of any identifying information,
No intentional (by us) or accidental (by my browsers) storing of real passwords,
20
Figure 3.1: User study flow
To satisfy both research and privacy protection goals, I designed my study as shown
in Figure 3.1. This study was reviewed and approved by my Institutional Review Board.
The study consists of the following steps, which are performed in my lab on my laptop,
in a Chrome incognito window. The window is opened for each study participant and it
is closed after the participant completes the study. This ensures that no login credentials
or sessions/cookies remain stored in the browser.
Pre-study surveys. First, a participant is asked to fill “Statistics”, “Password Strat-
egy” (Table 3.1) and “Risk Perception” surveys (Table 3.2).
Compiling a list of websites. Next, CloudSweeper tool [SK13] scanned a
user’s GMail account to compile a list of sites where a user may have an account.
CloudSweeper uses OAuth2 protocol to access GMail, thus a participant’s GMail cre-
dentials are not seen by my software. Regular expressions are used, as described in
Section 3.3.1 to identify account creation and password reset e-mails sent by online
sites to a user-supplied e-mail address. Then, each site’s URL from such messages were
extracted. While I could have asked users to provide a list of sites where they have an
account, I believed, and my results confirm this, that users may not be fully aware of all
the accounts they have, since they were created over a long time.
Collecting participant login information. Next, the list of site URLs is shown to
the participant. The participant can delete sites where she does not have an account,
or sites that are sensitive. The participant can also add to the list other sites where she
21
has an account. Next, the participant is asked to mark each site in the remaining pool
as important to her or not, and as the one she frequently visits or not. I provided no
guidance to users how to assign these tags, but I find that users generally marked sites as
important if they cared about security of their content at these sites (see Section 3.4.3).
Finally. the participant is asked to choose at least 8 important and 4 non-important
sites to log on to; I selected this blend of sites to include both those frequented by the
participant and those that she may have forgotten about. The participant could choose
to visit more sites.
The Google Chrome Extension was developed to capture the password from each
login attempt and describe it in Section 3.3.2. Character-length information is collected
for the original password. I also note if there is capitalization or mangling in the orig-
inal password and if they are at the beginning, in the middle or at the end. I do not
store more detailed data about positions of such changes because I believed that this
would unduly increase privacy risk, while not bringing much research benefit. Finally,
I input the original password into my local installation of thezxcvbn [Whe16] strength
meter, and retrieve the resulting strength. I then transform the original password into its
semantic equivalent, as described in Section 3.3.3. Such transformed password does not
expose any information about the original password beyond its semantic structure, e.g.,
noun+verb+number. I then store the transformed password and forget the original.
Post-study surveys. After the logins, each participant is asked to respond to ques-
tions M1–M3 from the “Mental Model of Attackers” survey (Table 3.1). Then, for each
site where the participant attempted to log on we ask the “Impact Reasoning” survey,
which asks her to rate on a Likert 1–5 scale, how aected she would be if a stranger or
a friend impersonated her on that site, or if her data from that site were made public.
Finally, the ‘Password Reasoning” survey was provided, which asked the participant for
each reused or versioned password to narrate the reasons for reuse.
22
Pattern
welcome to
reset password
thank you/thanks for registering/creating/signing
your .* account has been created
Table 3.3: Patterns in an e-mail message that indicate that a user may have an account
on the sender’s site.
I now provide more details about my detection of user accounts from emails, record-
ing the login attempts and semantic transformation of passwords.
3.3.1 Detecting Sites With Participant Accounts
Cloudsweeper tool [SK13] extracts emails that contain a specific pattern by connecting
to Gmail’s IMAP using OAuth tokens. We leverage this functionality to identify emails
that match certain patterns relating to new account registration and password reset e-
mails. I identified nine commonly used patterns by websites in “new registration” and
“password reset” emails, shown in Table 3.3. I encode these as regular expressions and
use Gmail’s “X-GM-RAW” IMAP extension to search from them in e-mail body.
The identified e-mails are further processed to reduce the likelihood of falsely iden-
tifying a site as having the participant’s account. First, I filter out e-mails that have more
than one recipient in “To” and “CC” fields, because account-related e-mails are sent only
to the account owner. Second, I use regular expressions on the email subject and body
to: (1) filter out e-mails where the URL in the body of the e-mail does not match the
domain name of the e-mail’s sender, and (2) filter out e-mails where the string following
the “Welcome to” does not match the domain name of the e-mail’s sender. In each of
the match checks, I use python’s SequenceMatcher [Pyt16], which calculates similarity
between two strings. If the SequenceMatcher returns similarity ratio of 0.5 or higher,
I say that the given e-mail has passed the check. Due to lack of ground truth, I cannot
23
measure how accurate my tool is in identifying user accounts from e-mails. However, I
worked closely with a small group of pilot users to refine and improve this tool, using
manual verification and user feedback, before I ran my user study.
3.3.2 Password Extraction
The Google Chrome extension is developed to extract the password from login attempts,
while preserving users’ privacy. The extension is enabled manually during each study
instance. During each login attempt, on each key press on the password field, we capture
the user input. I detect login events using JavaScript window object’sonbeforeunload
event [W3S], and this triggers sending of the last captured input for semantic transfor-
mation.
The extension is also responsible for recording successful attempts. I record attempts
as successful if, after the login event is triggered, the following page that loads does not
have a password input field. I found this approach reliable for wide range of websites,
though some sites have a password input field on their dashboard after successful log in
(for e.g., GNU mailman dashboard). For such sites, I recorded the successful attempt
manually in the database.
3.3.3 Semantic Transformation
The extracted password is sent for semantic transformation to an application running
on the same laptop as the Chrome browser. The application was my modified version
of a tool originally developed by Veras et al. [VCT14b] for semantic segmentation and
part-of-speech (POS) tagging of strings. I first remove any mangling, before feeding the
password into the semantic segmentation and tagging tool. This is done by following
the KoreLogic’s L33t password cracking rules [Kor].
24
The semantic segmentation and tagging tool transforms an input string into the list of
segments and their (POS) tags. The tool uses POS tags from CLAWS7 tagset [UCR16].
For example, for a string “applerun” the string would return segments (apple)(run) and
tags (NN1)(VV0) indicating a singular noun and a base form of a verb. Some segments
may be returned untagged, such as random sequences of letters, numbers and special
characters.
Next, I transform each segment into a dierent segment in the same semantic cate-
gory, to preserve privacy for the user. The goal was to achieve consistent but irreversible
transformation of segments. For example, if a user had two passwords “john352@” and
“john222”, the semantic segmentation and tagging would result in POS tags indicat-
ing (proper-name)(3-digit-number)(special-char) and (proper-name)(3-digit-number). I
then wanted to transform the proper name “john” into another proper name consistently,
so that the resulting two passwords continue to have one common segment. I further
wanted to transform the 3-digit numbers 352 and 222 into dierent 3-digit numbers and
the special character “@” into a dierent special character. For example, “john352@”
and “john222” could be transformed into “bob475!” and “bob681,” respectively.
I treat POS-tagged and untagged segments dierently for the transformation. I
achieve the consistent and irreversible mapping for POS-tagged segments by employ-
ing a keyed one-way hash function, with a random per-participant key, and a dictionary
of words for each POS tag. I used the same dictionary and Python pickle files as those
used in [VCT14b]. For each participant, I generate a random key from the range [2,
2
32
]. This key is appended to each segment and the resultant string is hashed using a
one-way hash algorithm – SHA512. I append the key to enlarge the space of possible
inputs to the hash function. This is especially important for segments that may have few
unique inputs, such as locative nouns. As the next step, I extract only the digits from the
hash function’s output and calculate the modulo of the resulting number and the size of
25
the dictionary for the given POS tag. I use this result as an index into the dictionary to
find the word, which will replace the original segment. Consistency is achieved because
same segments result in the same input to the hash. Privacy is achieved because of my
use of a per-participant, random key. This key remains in memory during the partici-
pant’s engagement in the study, and is deleted when I close the application. Because
I cannot reproduce the input to one-way hash function without this key, neither I, nor
anyone else, can reverse the mapping.
For some POS tags, my dictionary has fewer than 100 words. I aggregate such tags
with the tags in their parent category. For example, words belonging to NNL1 and
NNL2 (locative nouns) were added to the tag NN (common noun). If a tag has fewer
than 100 words even after this combination it is grouped into the “OTHER” category. In
my participant data, I did not have any segment tagged with “OTHER”.
Untagged segments mainly include random alphanumeric or special characters, but
may also include words in a foreign language or misspelled words. For such segments,
I generate random sequences of characters in the same category as the original ones
(alphabetic, numeric or special), and achieve consistency by storing this mapping in
memory for the duration of the participant’s engagement with the study. Irreversibility
is guaranteed by the randomness of the mapping, and because we delete this mapping
when the participant exits the study.
Finally, I store the semantically transformed segments, their POS tags, capitalization
and mangling information for all segments, authentication status, and the length and the
strength of the original password.
26
3.4 Results
I now discuss limitations of my study, and present findings about how users create pass-
words, how they reuse them and the causes of such behaviors.
3.4.1 Statistical Tests
To analyze the statistical significance across dierent approaches, I performed for an
ominbus test,
2
(Pearson’s Chi-squared test with Yates’ continuity correction), on cat-
egorical data with p = 0.05. For quantitative data, I used Kruskal-Wallis (KW), which
does not assume normality for omnibus tests. If KW test showed statistical significance,
I used Mann-Whitney U (MWU) test for pairwise comparisons.
3.4.2 Limitations and External Validity
My experimental methodology required a significant amount of interaction with inter-
viewing participants for survey questions without external funding. So, I could not pay
much to participants, where each participant also spent 30–45 minutes in the study with
$10 per participant. This constraint is why 48 of 50 participants were local students
(54% (27) were males and 46% (23) were females).
While the statistically significant findings here build a new understanding of how my
select population approaches the problem of maintaining several passwords across dif-
ferent classes of online services, further research is needed to test whether these eects
generalize to larger populations.
Another limitation lies in my approach to account identification through e-mails. I
may miss some sites where a user has an account because: (1) a site does not send wel-
come messages nor use e-mail address for password resets, (2) a user supplied another
e-mail account to the site, (3) my account identification may miss some patterns that
27
Account Frequent Not frequent Total
Important 212 180 392
Not important 29 200 229
Total 241 380 621
Table 3.4: Account types in my study
Account Frequent Not freq. Total
Important 86% 71% 79%
Not important 76% 57% 59%
Total 85% 63% 72%
Table 3.5: Login success rates per account
indicate account existence, and (4) a participant may remove some sensitive sites from
my list.
3.4.3 Statistics
In this section, I present general statistics on my study population, their accounts and
login attempts.
Account composition. Participants attempted to log into 621 accounts in my study.
I show their breakdown across important/not important and frequently/infrequently used
categories in Table 3.4. 392 of accounts were marked as important by users, 241 were
marked as frequently used and 212 were in both of these categories. Additionally 29
sites were marked as frequently used but not important, and 180 were marked as impor-
tant but not frequently used.
I provided no guidance to participants about what “important” means, thus they
may have flagged a site as important based on their preference for content, rather than
security considerations. I investigate this by comparing the participant responses to
IR1 question in my Impact Reasoning survey, which asks a participant how aected
28
Figure 3.2: User ratings for PS3–PS6 questions.
she would be if a stranger could access her account. I compare the responses to IR1,
provided on a Likert 1–5 scale, for sites that a user marked as important versus those that
a user marked a non-important. There is a significant dierence (KW p = 2:17 10
13
)
between ratings of these two groups, with higher values assigned to important sites.
Login success. Participants successfully logged into 470 accounts. I show the login
success rate across importance and frequency categories in Table 3.5.
Success rate for important accounts (79%) was higher than that for non-important
accounts (59%);
2
1
= 26.734, p = 2:334 10
7
. Similarly, success rate for frequent
accounts (85%) was higher than that for non-frequent accounts (63%);
2
1
= 30.993,
p = 2:589 10
8
. I also compared success rates of four account categories, which com-
bine importance and frequency (
2
3
= 42.599, p = 2:995 10
9
) . Accounts that were
both important and frequently used had the highest login success (86%) and those that
were neither important nor frequently used had the lowest login success (57%); Holm-
Bonferroni-corrected (HC)-Fisher’s Exact Test produced p = 5:621 10
11
. Similarly
accounts that were both important and frequently used had higher login success than
those that were not important but were frequently used (HC-FET, p = 5:621 10
11
).
However, login success for accounts that were both important and frequently used was
not statistically dierent from success for accounts that were important but not fre-
quently used (HC-FET, p = 0:11722).
29
Figure 3.3: User ratings for R1–R20 questions.
Number of accounts. Out of 50 participants, 29 reported (question ST4, Statistics
survey) that the GMail account they used in the study was their primary e-mail account,
where “primary” means that they use this account when creating new online accounts.
I also asked participants how many E-mail accounts they had (question ST3, Statistics
survey). All responded that they had 2 or more, and 76% had 3 or more.
I now compare the subjective measure of the number of online accounts (question
ST1, Statistics survey) to my objective estimate, based on the GMail account scans.
I show the distribution of these subjective and objective measures of the number of
accounts per participant in Figure 3.4. Median subjective estimate is 15 accounts, but
median objective estimate is 40 (primary GMail account) and 15 (non-primary GMail
account). I report the median and not the mean, as there are several users with a very
high number of accounts. I conclude that in population, users severely underestimate the
number of accounts they have. Users are likely unaware how often they create accounts,
30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140 160 180
cdf
# accounts
Subjective
Objective - primary
Objective - non-primary
Figure 3.4: Number of accounts per participant as estimated by the participant (subjec-
tive), and as measured in my study (objective).
as creation occurs over a long time period. My finding of 40 median accounts per user
updates the finding from a password study by Florencio and Herley [DC07] performed
in 2006, which reported 25 accounts per user. This increase is expected as many more
online services became available to users over the past ten years.
Number of passwords. When asked how many distinct passwords they had (ques-
tion ST2, Statistics survey), participants estimated between 3 and 30 passwords, with
the average 6.6 and the median 5. Because I only asked subjects to log in to 12 dierent
sites, I do not know the exact distribution of distinct passwords per user. However, I
examine subjective and objective estimates of the level of password sharing in Section
3.4.4.
31
3.4.4 Reuse
I now examine how often users reuse their passwords. There were 160 unique passwords
in 446 successful logins. Out of these, 72 or 45% were used only at one site, while the
rest were reused.
Subjective estimate of reuse is large. Looking at the users responses to my Statis-
tics survey (ST1 and ST2) 98% of participants stated they have fewer passwords than
accounts. Based on these subjective measures, participants believed to share a password
among 4.7 accounts on the average. However, I found in Section 3.4.3 that participants
underestimate the number of accounts they have, by a factor of 2.6. If the subjective esti-
mate of the number of passwords were correct, this puts the actual password reuse close
to 10 accounts per password. Both prior studies [DC07, WRBW16] tracked number
of user accounts by instrumenting their browser to record successful logins. However,
users have many accounts that they create once and then use rarely or never again. My
study has potentially revealed these accounts through the inbox scan, and thus produced
a higher estimate for password reuse.
Objective estimate of reuse is large. Table 3.6 summarizes my findings about
reuse, counting a password as versioned if it shared at least one segment with another
password by the same user. I find that reuse is rampant! 98% of participants reuse
their passwords among accounts, and the remaining 2% have similar passwords between
accounts. Further, 84% of participants reuse a password from an important site at a
non-important site, and additional 6% have similar passwords between important and
non-important accounts. Also, 98% (100% including similar passwords) of users reuse
their important-site password at another important site, but only 64% (72% including
similar passwords) reuse their non-important-site passwords at another non-important
site. This data indicates that many users create a limited number of passwords and
32
Type of reuse Verbatim Verbatim
sharing or similar
All accounts 98% 100%
Important/Non-imp 84% 90%
Important/Important 98% 100%
Non-imp/Non-imp 64% 72%
Table 3.6: Password reuse: percentage of participants that reuse in a given way.
reuse them without discrimination at both important and non-important sites. Average
number of accounts per password in my study, among successful logins, was 2.9. This
is an underestimate of actual reuse, because participants only logged into 12 accounts
during the study.
Users not aware of their reuse patterns. I now compare subjective vs objective
reuse of passwords for each participant. I obtain subjective estimate of reuse by dividing
user-reported number of accounts with a user-reported number of passwords. I obtain
the objective reuse of passwords by dividing the number of the sites a participant suc-
cessfully logged into, with the number of unique passwords in all successful logins for
that participant. This comparison is not unbiased, as the accounts the participant chose
to log into may not be an unbiased sample out of all her accounts. In 34% of cases
subjective estimate of reuse is lower than the objective estimate, indicating that users
are not aware how often they reuse passwords.
I then examine responses to PS4 “I use dierent passwords for dierent accounts
that I have”. Figure 3.2 shows that users were quite divided on this question. I use
the Spearman’s rank correlation to measure the correlation between user’s response,
and the subjective and the objective estimate of reuse. There is statistically significant
negative correlation between the response to PS4 and subjective estimate of reuse (r =
0:4091 and p = 0:0032). Thus users who self-report stronger intentions to use dierent
passwords also self-report lower estimate of number of accounts they have per password.
33
On the other hand, I did not find significant correlation between a user’s response to PS4
and objective estimate of reuse (r =0:0046; p = 0:9749). This finding is contrary to
finding by Wash et al in [WRBW16], which reported significant correlation between
a user’s response to PS4 and objective estimate of reuse. I attribute this dierence to
dierences in accounts accessed by my two studies. Wash et al. study monitored a
user’s usual login patterns, which mostly included sites that this user regularly accesses.
Users are aware of accounts they have on these sites and can correctly estimate their
password reuse on this subset. On the other hand, my study asked each user to log into
a subset of all sites at which they have an account. As many of these sites are rarely
accessed, the user could not correctly estimate their password reuse.
Users reuse both strong and weak passwords. Similar to Wash et al. in
[WRBW16], I measure correlation of password strength versus number of accounts
where this password is reused verbatim. I find significant negative correlation between
these measures (Spearman’s rank correlation, r =0:2, p = 0:01089). This contra-
dicts findings in [WRBW16] where they found significant positive correlation between
password entropy and reuse (r = 0:063, p = 0:007). The dierence in finding may
result from dierent strength measures – I use statistical guessing measure fromzxcvbn
strength meter while Wash et al. use a weaker measure of password entropy. However,
when I repeat my test using entropy I still find no significant correlation (r =0:16,
p = 0:08).
Measuring correlation between password strength and reuse may not be the best
way to understand reuse patterns per user, as a strong password for one user may be
weak for another user. However, I cannot measure correlation per user as I do not have
enough samples. Instead, I investigate user’s strategy of reuse if her passwords were
grouped into a stronger and a weaker set. I form these two sets by first identifying
the strongest and the weakest password per user and clustering each reused password
34
1
100000
1x10
10
1x10
15
1x10
20
1x10
25
0 10 20 30 40 50 60 70
same
stronger
weaker
both
log(strength)
user ID
Figure 3.5: Password reuse patterns per user: 4 users reuse passwords that are of similar
strength, 3 reuse only stronger passwords, 17 only weaker passwords and 25 reuse both
stronger and weaker passwords.
with the closer of these two points. If the maximum and the minimum strength dier
less than an order of magnitude, I assume that there is only one set of passwords for
the given user. Figure 3.5 shows reuse patterns in my user set, showing minimum and
maximum per user as error-bars and reused passwords as red dots on the bars. The y
axis shows password strength on the log scale. Out of 49 users that reused passwords
verbatim, four had passwords with roughly the same strength – this is the area marked
“same” in the Figure. Additionally three users reused their stronger passwords (area
marked as “stronger”), 17 reused their weaker passwords (area marked as “weaker”)
and 25 reused both stronger and weaker passwords (area marked as “both”). I conclude
that strong and weak passwords are reused comparably often.
35
Simple password versioning. I compare pairs of passwords by the same participant
to detect password versioning – slight changes in the password structure that may be eas-
ily guessed by an attacker. I say that two passwords are similar if they have at least one
common segment and at least one dierent segment. 34 out of my 50 participants have
at least one pair of similar passwords. Overall, I found 61 similar pairs. I then exam-
ined the changes between passwords and detected eight change patters, shown in Table
3.7. 62% of passwords are versioned in a very simple manner, by changing or adding
a number, a special character, one dictionary word, or by introducing capitalization or
mangling. If an attacker obtains one password from a pair he can guess the other one
with a very small number of tries. 38% of passwords experience more complex trans-
formations that combine 2–3 simple techniques and may change or add two dictionary
words. The attacker will need more tries to explore the space of these transformations,
yet much fewer tries than if he were to brute-force the more complex password in the
pair.
Password managers do not help. I asked users if they allow browsers or password
managers to save their passwords, in question PS2 in my Password Strategy survey.
60% of users said they allow this always, 24% said they do not allow it and 16% allow it
sometimes. To examine whether (browser based or stand alone) password manager helps
users make better password choices, I compare the password strength and reuse between
always-use and never-use groups. I did not find significant dierence in strength (HC-
MWU, p = 0:128) or reuse (p = 0:201).
3.4.5 Why Is Reuse Prevalent?
When I detected reuse of the same password verbatim, I asked the participant about its
cause. 100% of 49 participants who reused a password verbatim said they did it for
memorability reasons. In addition, five (10%) said they share passwords only among
36
Type of change Num. of Pass. Example
simple
Change/add numbers 15 qklt! qklt18
Change/add spec. chars 4 romine6719 !
romine6719]
Change/add alph. chars 2 jdofabergs36859 !
cqkfabergs36859
Capitalize first/last char 4 coumadinjamkaran!
Coumadinjamkaran
Switch digit/spec. positions 1 vwuondfk.3 !
vwuondfk3.
Change/add one word 12 2735770770 !
benigno 2735770770
Total 38
complex
Combination of 2–3 techniques above 20 tranquillizersdenham
! tranquillizersden-
ham9?
Change/add two words
and opt. num/spec. char 3 ddn1ddn! ddn1ddn
auditionsboorstin372
Total 23
Table 3.7: Password reuse: percentage of participants that reuse in a given way.
accounts they do not care about. I investigated this claim by examining important and
non-important site passwords for these users. In all five cases these users shared a pass-
word between at least two important sites, and they also shared a password between an
important and a non-important site. This is another instance where a user’s subjective
perception of their password strategy does not match their actual behavior.
I further asked participants if they were concerned about password-reuse attacks.
Some users were ill-informed: 10% did not know about password-reuse attacks, 8%
knew but thought that strong passwords are immune, which is incorrect. Other users
underestimated the risk of attacks (15% believed it unlikely) or made a conscious deci-
sion to reuse because memorability was more important to them than security (44%).
37
Finally, I asked users that had similar passwords why they changed their password.
62% did so due to policy requirement and 38% did it out of free will. Thus users
are conscious that verbatim reuse is bad, and they attempt to find alternatives through
password versioning. Yet their versioning is simple and does not help with password-
reuse attacks.
3.4.6 Password Strength
In this section I present my findings about password strength in the user population.
Weak passwords. Figure 3.6 shows the distribution of password strength for suc-
cessful logins to important and non-important sites, as labeled by participants, and
across all sites. I obtain the strength estimate from zxcvbn as the expected number
of guesses before success.
Florˆ encio et al. [FHVO16a, FHVO14a] suggested that 10
6
and 10
14
guesses are
the reasonable coarse estimate for a password to withstand against online and oine
attack. Unfortunately, 27% of important-site passwords and 48% of non-important site
passwords were vulnerable against online attack, and 57% of important-site passwords
and 88% of non-important site passwords were vulnerable against oine attack. These
findings indicate a disconnect between a user’s desire to make a password strong (by
making it longer), and the outcome.
Longer is stronger. While long passwords in my dataset are still weak, making a
password longer helps. I confirm that there is significant correlation between password
length and strength by using KW test (p = 1:39 10
26
). The Pearson Product-Moment
correlation is strongly positive (r = 0:851 and p = 3:58
130
).
Important sites have longer and stronger passwords. I find that important sites
have passwords that are on the average 1–2 characters longer and about 20 times stronger
38
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 100000 1x10
10
1x10
15
1x10
20
1x10
25
1x10
30
1x10
35
online attack
offline attack
cdf
estimated # guesses
Important
Not-important
All
Figure 3.6: Distribution of password strength for important and non-important sites
than those at non-important sites (Kruskal-Wallis: p = 0:0004 for strength, p = 0:0013
for length).
3.4.7 Why Are Passwords Weak?
I test several hypotheses to understand causes of weak passwords.
Hypothesis 1: Users do not understand risks of attacks. A user may believe
that password cracking attacks are rare or that she is not likely to be their target. She
may also have a poor mental model of how these attacks occur and how powerful they
are. In my Risk Perception survey, I measure users’ attitudes toward risk. I adopt these
questions from Creese et al. in [CHJPW13]. Participants’ ratings are shown in Figure
3.3.
39
Creese et al. in [CHJPW13] found that answers to questions R4, R9, R16, R18,
R19 and R20 were correlated with password strength. Specifically, the magnitude of
the dierence (measured as root-mean-square or RMS) between a participant’s ratings
and the experts’ ratings was positively correlated with smaller keyspace of passwords. I
repeat their approach on my data, and use their experts’ ratings, but find no significant
correlation between participant’s misconceptions about risk and their password strength
(Pearson’s correlation r = 0:06, p = 0:67 for maximum password strength and r =
0:086, p = 0:89 for average). Because Creese et al. used keyspace as measure of
password strength I repeat calculations with this measure and still find no significant
correlation (r =0:1, p = 0:46 for average password strength, r = 0:05, p = 0:69
for maximum). For completeness, I also attempted to find correlation between each of
R1–R20 responses and average or maximum password strength, but found no significant
correlation. I omit details of these calculations due to space.
Next, I code responses to narrative questions M1–M3 in the Mental Model of
Attacker survey. These questions ask how many password guesses an oine attacker
could make per minute (M1), and how he would craft these guesses (M2). Question M3
tests if a participant has heard of oine guessing attacks against passwords. I round
the answers to question M1 to the nearest power of 10. I code the answers to questions
M2 and M3 as: (d) don’t know, (p) using personal info, (d) using dictionary. Similar to
Wash et al. [WRBW16] I find that users severely underestimate the speed of the pass-
word cracking – 76% assumes the attack to be of online nature, and the rest assume an
oine attack. Around half of the users do not know how attackers formulate guesses
or believe they use personal information about a user. Further, I found no significant
correlation between user responses to questions M1–M3 and their average or maximum
password strength. Spearman’s rank correlation test had all p-values higher than 0.49
40
and all values between0:1 and 0:11. Thus better informed users did not create
stronger or longer passwords and I cannot confirm hypothesis 1.
Hypothesis 2: Users do not know how to create strong passwords. A user may
understand the need for strong passwords but may not know how to create one. In my
Password Strategy survey I ask users to narrate how they create passwords. I break
PS1 into three sub-questions asking about choices for password segments, the blend
of character classes and if the user considers her password strategy good. I then code
responses to the first two sub-questions. Together these provide information about a
user’s password composition.
PS1-1: How are the password segments chosen: (d) dictionary words, personal
names, significant locations, something in the environment, or (r) random charac-
ters/digits
PS1-2: How many classes of characters are used: (1) one class, (2) two classes,
(3) three or more classes
Finally, I combine codes for PS1-1 and PS1-2 to arrive at a password strategy. For
example, strategy d-2 would mean that a user starts with one or more common words (a
name, a dictionary word) and adds to them one more character class. The most popular
password strategies were d-2 and d-3, favored by 72% of participants. They start with
a dictionary word and add one or two more character classes. Majority of participants
– 93% – used names and words of personal significance in passwords and increased the
strength by adding numbers and symbols, and capitalizing parts of passwords. 80% of
participants said they use two or more character classes, and 30% use three or more.
These results indicate that most users understand how to create strong passwords with
regard to password composition.
41
Do users with better password composition end up having stronger passwords? I
answer this question by looking for significant dierences in password strength between
users that narrated dierent password strategies. Because of a small number of samples
for some strategies I consider coarser categories. I find that users that intended to use
random segments had no stronger passwords than those that intended to use dictionary
words (Kruskal-Wallis, p = 0:7725). Similarly, users that said they use two or more
character classes had no stronger passwords than those that said they use one class
(Kruskal-Wallis, p = 0:11). Thus a user’s intended password composition does not
significantly influence their password strength. It is possible, however, that their actual
password composition diers from their intended one. I explore this in hypothesis 4.
Hypothesis 3: Users know how to create strong passwords, but choose not to
do it. A user may have all the right knowledge, but choose to disregard it in favor
of memorability, or because they do not care if their accounts get hacked. I asked the
participants in my Password Strategy survey (question PS2) if they thought their strategy
were good. 28% of participants said they knew their strategy was bad but continued to
follow it, 10% thought it was OK, and 62% thought it was good. The Kruskal-Wallis
test shows significant statistical dierence p = 0:0009 < 0:05 in password strength
between users with good and bad strategies. Passwords of bad-strategy respondents
were indeed weaker than those of good-strategy respondents (Mann-Whitney U-test,
p = 1:8850 10
4
). This confirms hypothesis 3.
I further analyzed narrative responses by bad-strategy participants to question PS2.
One of the participants said: “Its probably not good, but I am not terribly worried about
my passwords being found out.” Another participant said: “I choose whatever is easy
to remember. I think its bad. But, I don’t want to use password resets frequently.”
Also, two participants remarked that their strategy is not good but it is easy to use. Thus
42
1
100000
1x10
10
1x10
15
1x10
20
1x10
25
1x10
30
1x10
35
0 5 10 15 20 25 30 35
online attack
offline attack
password strength
password length
d-1
d-2
d-3
r-1
r-3
(a) Intended
1
100000
1x10
10
1x10
15
1x10
20
1x10
25
1x10
30
1x10
35
0 5 10 15 20 25 30 35
online attack
offline attack
password strength
password length
d-1
d-2
d-3
r-1
r-3
(b) Actual
Figure 3.7: Intended and actual password strategies, showed jointly with password
length and strength.
43
memorability and convenience seem to motivate these users to continue unsafe password
practices.
Hypothesis 4: Users know how to create strong passwords in theory, but strug-
gle to implement this in practice. A user may have all the right knowledge in theory,
but fail to implement it in practice. I detect instances where theory mismatches practice
by comparing a participant’s intended, subjective password strategy (Password Strat-
egy survey) and their actual, objective password strategy (extracted from transformed
passwords).
I derive the objective password strategy by using the segmentation of each
successful-login password. I regard all POS-tagged segments as “meaningful-word seg-
ments” and those that were untagged as “random segments”. I then encode passwords
where length of random segments exceeds that of meaningful-word segments as (r) ran-
dom, and the rest as (d) dictionary. Thus passwords encoded as random may not be fully
random, but they have more random than meaningful content. I also encode the charac-
ter mix of a password using information about capitalization, mangling and presence of
character/digit segments. Thus I arrive at the same tags I used for subjective password
strategy.
I consider the prevalence of dierent password strategies among unique passwords
from successful-logins in my study. The actual password strategies are very evenly
distributed, with the most popular choice being d-2 (32% of passwords), and the least
popular choice being d-3 (8% of passwords). Overall, 60% of passwords use dictionary
words and the remaining 40% use only random characters and/or digits. Further, 27%
of passwords use a single character class, 54% use two character classes and 19% use
three character classes. The Kruskal-Wallis test shows significant dierence in strength
(p = 0:0027) between passwords in d-1 and d-2 classes (MWU test, p = 0:0034) and
d-1 and d-3 classes (p = 0:0017) but not between d-2 and d-3 classes (p = 0:7251).
44
Comparing strength between r-1, r-2 and r-3 classes I find significant dierence
(Kruskal-Wallis test p = 0:0002) between r-1 and r-2 classes (MWU test, p = 0:0053),
between r-1 and r-3 classes (p = 2:8410
4
), and between r-2 and r-3 classes (p = 0:01).
When I compare between overall d and r strategy, I find that random passwords were
stronger than those using dictionary words (KW test, p = 0:0016).
I illustrate how the actual password composition and length interplay in Figure
3.7. The actual password composition influences somewhat the password strength, with
many robust strategy passwords being above the bar, and many weak-strategy passwords
being below. Yet, length still plays a considerable role. Around half of the random pass-
words, and all d-3 passwords, are too short and thus vulnerable to oine attacks.
I conclude that users know how to compose strong passwords in practice, but may
not understand long their passwords should be.
I next compare subjective and objective password strategies and note how often an
objective strategy is weaker, stronger or equal to subjective strategy. For example, if a
user declared to use r-1 strategy, but ended up using d-1 I would say that her objective
strategy is weaker than her subjective strategy. I find that 28% of passwords use a weaker
strategy than narrated by a user, 48% use a stronger strategy and only 24% match. This
confirms that users do not consistently implement their subjective password strategy, but
also shows that they often use a stronger-than-intended strategy. Thus I cannot support
hypothesis 4.
Hypothesis 5: Password policies lead to weak passwords. It is possible that
users do not independently decide on a given password strategy, in a rational manner,
but instead adopt a site-suggested password policy. To test for this I surveyed, with
the help of Mechanical Turk workers, the password policies of 210 randomly selected
websites where participants successfully logged in during my user study. 27% of these
websites did not have a minimum length requirement, 33% required a minimum or 6
45
characters, and 28% required a minimum of 8 characters. With regard to password
composition, 76% of sites had no class requirement, 1% required a specific class, 8%
required two classes, 10% required three classes and 5% required four classes. Further,
18% employed some password meter.
How much did site policies influence the strength of the actual passwords users
created? To answer this question I compare the strengths of passwords in the follow-
ing categories: (1) sites that require 8 characters or less versus sites that require more
than 8 characters, (2) sites that have some requirement about character classes versus
those that do not, (3) sites that have a password meter versus those that do not. Sites
that require 8 characters or less have significantly weaker passwords than those with
a stronger requirement (KW test, p = 0:0003). On the other hand, class requirement
or presence of a password meter did not significantly influence password strength (KW
test, p = 0:3694 for class, p = 0:3129 for meter).
3.5 Recommendations
Single factor authentication for important accounts is particularly dangerous when
paired with my finding that even technologically savvy users, like college students,
engage in unsafe password practices. I draw on my findings of how and why users
engage in unsafe password practices to propose interventions that are aligned with user
capabilities.
Help users in understanding their current choices. Users have many online
accounts created over long time periods. They cannot keep track of them, nor rational-
ize their password behaviors. Automated assistants (browsers and password managers)
could help users by long-term tracking of their accounts and password, and periodical
summaries and analysis of this data. For example, once monthly a user may see a report
46
that states “You have 100 online accounts but have only used 15 in the past year. You
have used only 3 dierent passwords on these 100 accounts.”
Suggest and automate better strategies. When a user is given a summary report,
like above, I should develop mechanisms that suggest meaningful actions, and imple-
ment them automatically. For example, the assistant may ask “Do you want me to delete
these 85 unused accounts or reset your password there to a random string?” and proceed
to do so, if the user answers armative. The user should also have an option of exam-
ining unused accounts, aided by the assistant and selecting only some for automated
remediation. The assistant could also analyze passwords for the frequently used sites
and raise alerts when reuse is detected.
Make better use of password assistants I found that users that use password man-
agers have no significantly stronger passwords nor less reuse than users that rely only on
memory. This is a missed opportunity. Users that use password assistance could have
very long, very random and unique passwords for each account. Password assistants
could suggest this policy and aid users by automating its implementation.
Improve password policies Sites should require longer passwords, and suggest
strategies that help users create those. For example, a site could suggest “use two mean-
ingful words and add two numbers” as a strategy. Sites should also implement more
realistic password meters, and suggest specific improvements when a password fails the
meter. As suggested by Ur et. al [UKK
+
12], combining better text and visual feedback
can help users create longer and more diverse passwords.
47
3.6 Conclusions
Users are overburdened today with many online accounts. They are also confused with
weak, inconsistent password policies, simplistic password meters, and with many fac-
tors that influence a password’s strength. Like a large amount of other research on this
topic, I find that many passwords are weak. I find that users have the good will to create
stronger passwords, but they do not understand how long their passwords should be and
how randomness of content interplays with length. I further find that users share pass-
words a lot, due to memorability issues, even when they use a password manager. While
good strategies for password management have been proposed (e.g., [FHVO14b]), users
struggle to implement them in practice. I hope that my recommendations may be better
aligned with user cognitive abilities and thus better adopted in practice.
48
Chapter 4
Life Experience Passwords (LEPs)
4.1 Introduction
Textual passwords are widely used for user authentication. An ideal password should
be easy for a user to remember, but dicult for others to guess. These two requirements
are at odds. People remember by association [Mas], relating their passwords to some
personally salient facts – but this leads to common and predictable patterns in passwords
that make them insecure against automated guessing [VCT14b, Bon12b]. Users also
tend to reuse their passwords to achieve memorability. But this lowers security, because
passwords stolen from one server can be used to gain access to another server.
Many alternatives to textual passwords have been proposed, such as graphical pass-
words, cognitive authentication, one-time passwords, hardware tokens, phone-aided
schemes and biometrics. However, research has shown that none of these can oer
convenience, simplicity, and user familiarity [BHvS12, BHvS15] comparable to that of
textual passwords. Thus, focus on trying to improve text-based authentication.
My insight is that it is highly unnatural to humans to create new, complex memories
(passwords) and recall them in minute detail (e.g., recall capitalization and placement of
special characters) after a long time period, and without any hints. Humans remember by
association, relating new facts to existing memories [Mas]. They also recall by recon-
structing facts, sometimes imprecisely, from relevant data stored in the brain [Mas].
Thus, a user may recall that they used a family member’s name plus their birth year for
49
Who was your favorite high-school teacher?
Who was your best friend in high school?
What was your favorite subject in high school?
How many memorable people were there?
For each memorable person, provide their first and last
name and why there are memorable to you?
Is there anything else memorable about high-school?
In 1-2 sentences describe what else is memorable
Miss Jackson
Noah Smith
Math
3
Noah Smith was my best friend. Miss Cole
was my gym teacher. I always fought with Mandy
yes
I broke my leg in gym and wore cast for 6 weeks
What was the first and last name of your best friend?
What was the last name of your gym teacher?
What was the first name of the person you always fought?
What happened in gym?
What did you wear for 6 weeks?
Noah Smith
Cole
Mandy
broke leg
cast
QUESTION ANSWER QUESTION ANSWER
Creation
Authentication
QUESTION ANSWER
Creation and authentication
Dimension Security questions LEPs
Applicability
Fact depth
Fact count
General
Shallow
1
Customized for this user
Deep
4-5
LEP Security question
Figure 4.1: Security questions versus LEP example for “high school” topic.
a password, but forget which family member they chose, whether they used a first or last
name, how it was capitalized, etc.
I propose a new authentication method – life-experience passwords (LEPs) – which
extracts authentication secrets out of a user’s existing memories, and uses prompts and
imprecise matching at the authentication stage to further improve recall. An LEP con-
sists of several facts about a user-chosen life experience, such as a trip, a graduation, a
wedding, a place, etc. At password creation, the system prompts the user for the expe-
rience’s title, and for facts related to this experience such as names of people and loca-
tions, special objects and activities, dates, etc. These facts are transformed by the sys-
tem into questions (stored in clear) and answers (stored hashed and salted). At authen-
tication, the system prompts the user with corresponding questions, and matches her
answers with those stored by the system, allowing for imprecise matches due to extra-
neous words, capitalization, punctuation and reorder. LEPs could be used for primary
or secondary authentication in cases when high recall and high security are desired.
I designed and evaluated two LEP designs in human user studies, approved by my
institutional IRB, to evaluate their recall and security. I found that: 1) LEPs are from
10
9
up to 10
14
stronger than an ideal, randomized, 8-character password; 2) LEPs are
50
2–3 more memorable than passwords, having 73% recall after a week and 54% recall
after 3–6 months; and 3) LEPs are reused half as often as passwords.
LEPs resemble security questions, in that both use personal experiences for authen-
tication. But LEPs use more facts than security questions, and these are deeper, more
specific facts. This makes LEPs hard to guess by friends or attackers who use social net-
works or public sources. When compared with security questions, 1) LEPs are 24–35
harder to guess by friends (only 0.7% of LEPs were guessed in my study); and 2) LEPs
contain 2.4–3.2 fewer fake answers (11.5% and 15.7% versus 37%).
There are two downsides to LEPs: 1) the user burden for creating and authenticating
them is 3–6 higher than when using passwords, and 2) they may contain sensitive
information. While this may not be acceptable to every user and every purpose, 93.7%
of users in my studies said they would use LEPs for high-security content (e.g., banking).
I further found that only 3% of LEPs contained generally sensitive information, which
could be reduced with better user prompts.
4.2 Related Work
There is much research on non-textual alternatives for passwords, such as graphical
passwords [JMM
+
99,mpa,DMR04a,MGua,MGub,Inc,Bra,BSR
+
12], videos [YHIN08,
DBvDJ11] and, biometric [bio]. For space limitations, I only survey research that is
directly related to textual passwords.
Improving Password Strength. A few recent publications [CDP12, VCT14b,
KSC
+
] proposed improvements in password design of using Markov models, seman-
tic patterns of passwords, and user feedback. For example, Telepathwords [KSC
+
] help
users create strong passwords by comparing user input with popular substrings. When
such substring is detected, Telepathwords provides actionable, real-time feedback to
51
steer a user towards a dierent choice. In [HOK
+
15] system randomly generates strong
passwords and then allows a user to replace a few characters to make the password
more memorable. All these techniques address password strength, but do not improve
memorability nor diversity, and all require users to create new memories of complex
strings. My work on LEPs diers fundamentally from these approaches because I seek
to exploit existing memories and thus increase strength, memorability and diversity of
textual passwords.
Security Questions. Security questions [Gui06] are often used for secondary
authentication, e.g., when a user loses her password, or to supplement password-based
authentication for high-security servers (e.g., bank). A user is oered a choice of a small
number of fixed questions, such as mother’s maiden name, pet names, favorite teacher
names, best friend names, etc. While both security questions and LEPs use personal
knowledge for authentication, there are significant dierences. I discuss them here and
summarize them in Figure 4.1, which also illustrates sample security questions and a
LEP on the high school topic.
Applicability. Organizations usually oer a very limited choice of security questions.
There may be users to whom no question applies. For example, a user may not have
a favorite high school teacher or a best friend, or may have multiple teachers/friends
that she likes. When faced with such questions users select answers that they do not
recall at authentication time. Schechter et al. [SBE09] and Bonneau et al. [BBC
+
15]
measured that 20–40% of security questions cannot be recalled by users. Conversely,
during LEP creation users can choose, with very little constraint, the topic they want to
talk about and the facts about that topic, which are memorable to them. This leads to
more personalized facts and thus higher recall.
Depth of facts. Security questions ask for shallow facts (e.g., pet’s name, best
friend’s name), which are generally applicable to many users. Such facts, can be mined
52
from public sources [GJ05,SBE09], or guessed using statistical attacks [BBC
+
15]. Easy
guessability leads users to provide fake answers to security questions, which leads to low
recall. On another hand, LEPs ask for deep facts – memorable people, places, activities
or objects in connection with a user-chosen event. Answers to such questions are not
easily found on social networks, or guessed by family and friends, which removes the
need for users to lie. In my studies, only 0.7% of LEPs were guessed by friends (com-
pared to 17–25% of security questions [SBE09]) and only 11.5–15.7% of answers were
fake (compared to 37% for security questions [BBC
+
15]).
Number of facts. Security questions contain only one fact, which may be easily
guessed or obtained from public sources. LEPs contain a larger number of facts, and a
user must recall most or all of them for authentication. Thus the barrier for a successful
guessing attack is higher.
Another approach to security questions is to let users choose the questions them-
selves. This allows users to freely choose facts that are relevant to them, but decreases
security [JA09, SBE09]. While LEPs also allow users to choose which facts to provide,
my fact elicitation guides them toward secure, memorable and stable facts. This allows
LEPs to outperform user-chosen security questions.
Cognitive Passwords. Similar to LEPs, cognitive passwords are based on personal
facts, interests, and opinions that are thought to be easily recalled by a user. Article [CP]
provides an overview, definitions, and some examples of cognitive passwords. Das et
al. [DHH13] and Nosseir et al. [NCD05] explore autobiographical authentication that
uses facts about past events, which are captured by smartphones or calendars, without
any user input. While such information may be memorable in short intervals after it is
collected, humans do not remember ordinary daily events for long periods nor with great
consistency [BRS87]. On the other hand, LEPs require more user eort during creation,
but elicit more salient facts [Nie03, BRS87], which is essential for good recall.
53
Narrative Authentication. Somayaji et al. [SMB13] propose use of narratives for
user authentication, but do not evaluate them. Their narratives require users to asso-
ciate imaginary objects with past memories (e.g., contents of a drawer from a childhood
bedroom), and may also be fully fictional. Because these narratives lack personal sig-
nificance to user, I expect they would be less memorable than LEPs.
4.3 LEP Design
This section describes the design and the implementation of life-experience passwords
(LEPs). I discuss my choices for LEP topics and facts in Section 4.3.1, attacker models
and strength calculation in Section 4.3.2, LEP creation process in Sections 4.3.3, LEP-
based authentication in Section 4.3.4 and LEP uses in Section 4.3.5.
Figure 4.2 shows the LEP creation process. A user identifies a life-experience she
wants to use for a LEP. She then inputs a title for this experience and recounts interesting
facts, with some guidance from the system. The system mines the facts from the user’s
input, and transforms them into question-and-answer pairs. Questions and the title must
be stored as clear text, because they are shown to users during authentication. To store
the answers, I concatenate either all of them, or several subsets (see Section 4.3.4), add
the salt, and hash the resulting string(s). During authentication, the system shows the
title and the questions, and user answers are hashed and compared to those stored by the
system.
4.3.1 Topics and Facts
In this Section, I discuss how much guidance should be provided to users during LEP
creation. In general, users need some guidance to remember interesting facts to recount.
54
user input user title
title
Processing
questions answers
facts
hash(es)
store
user answers
title
questions
hash(es)
match?
CREATION AUTHENTICATION
no
yes
Authentication
failure
Authentication
success
Figure 4.2: LEP creation and authentication
Further, elicitation must be carefully developed to result in such facts, which can be
accurately recalled by users, and cannot be easily guessed by others.
Which experiences can be used for LEPs? Letting a user freely select an experi-
ence to talk about, without any guidance, may not produce secure and memorable input,
as shown by research on self-built security questions [JA09, SBE09]. This motivated us
to build a list of diverse and general topics, to guide password creation (see Table 4.1).
Category Subcategory
Event Engagement, wedding, birth, death, accident,
graduation, party, trip
Learning Driving, skiing, snowboarding, swimming,
biking, skill/art, language
About Person, place
Table 4.1: LEP topics
How to elicit useful facts? A useful fact is a fact that is strong, stable and
immutable. A strong fact has many possible answers to the question, which gives it
strength against brute-force attacks (see Section 4.3.2).
A stable fact is consistently recalled by a user. I have learned by exploring several
LEP designs that stability is influenced the most by the fact’s type and my elicitation
method. Subjective facts about feelings and opinions are inconsistently recalled by
users. I thus ask about objective facts, such as names, locations, times, objects and
activities. Further, elicitation specificity makes a large dierence. The more specific
55
questions I pose during elicitation, the more stable facts I get. A user may use multiple
terms for the same person (e.g., “my mom”, “mom”, “mother”, “Jennifer”). Trans-
forming “who” questions into “first and last name” questions reduces the ambiguity and
increases stability of answers. Another source of instability comes from asking for a
singular answer to a question that may have a plural answer (e.g., a user has two best
friends). Asking too specific questions (e.g., “what is the first and last name ...” but the
user only recalls the first name) or questions that do not apply to a given user (e.g., a
pet’s name when the user does not own a pet) also leads to unstable facts. I believe that
stability issues, more than lying, may be responsible for many authentication failures
found in the past studies of security questions [SBE09, BBC
+
15]. I have refined my
elicitation process to contain very specific prompts, which depend on the user’s prior
input. This leads to stable facts.
An immutable fact does not change over time. For this reason I do not ask about
preferences and opinions (e.g., “What is your favorite band”), which leads to mutable
facts.
During LEP creation, I mine facts about people, locations, time, objects and activ-
ities. These facts are objective, and thus immutable. People and location facts have a
high strength (see the Section 4.3.2), while time, objects and activities have a lower but
still substantial strength. Further, I have designed my elicitation process to produce very
specific questions, and thus stable facts.
LEP questions and answers contain information about some past event, which may
pose privacy risk to a user if questions are observed by others, or if answers are guessed
or cracked. I advise users to avoid sensitive or incriminating facts during LEP creation.
My evaluation finds that only 3% of LEPs in my study had sensitive information, which
I plan to address with better LEP creation prompts.
56
Category Description Statistical strength Brute-force strength
Lists Min. size Max. size Unique items
FN first name (e.g., John) 384 3 38,717 150,695 285,537
LN last name (e.g., Doe) 80 100 151,671 223,096 6,209,229
FL first and last name (e.g., John Doe) combinations of FN and LN 563,335,972,290
PL place (e.g., UCLA) 48 10 18,467 36,864 1,398,314
CI city (e.g., Seattle) 8 85 870 2,230 754,450
OBJ object (e.g., watch) 30 6 19,681 22,210 139,049
ACT activity (e.g., kayaking) 4 14 276 385 11,539
DT date (e.g., 2/28/1972) n/a 18,250
YR year (e.g., 2001) n/a 50
RL relationship (e.g., mom) n/a 49
HU approx. 100 choices (e.g., Toyota) n/a 100
TN approx. 10 choices (e.g., yellow) n/a 10
Table 4.2: Fact categories and their statistical and brute-force strengths (see Table 4.3
for sources)
4.3.2 Strength
This section discusses my attacker models and how I calculate strength of LEPs against
these attackers.
Strength of a password can be measured as the number of trials a guessing attacker
has to make until success – this is known as guess number [Bon12b] or a heuristic
measure of password strength [BHvS15]. We thus use guess number as our measure of
strength of a password, and use Dell’Amico [DF15]’s Monte-Carlo method to calculate
the statistical strength. On the other hand, a LEP consists of multiple facts. I calculate
its strength as the product of individual fact strengths: S
LEP
=
Q
k
i=1
S
f
i
, where S denotes
strength, k is the number of facts in a given LEP, and f
i
is the i-th fact.
I classify LEP facts into categories, shown in Table 4.2. I assume an intelligent
attacker, who can infer the fact’s category from the user’s authentication prompt, and
guesses answers only within that category. Table 4.3 shows detailed data sources, which
we used to build our popular item lists.
Attacker Models. We consider the following attacker models [BBC
+
15, BHvS15].
A brute-force attacker compiles a dictionary of all possible answers within a fact cat-
egory (e.g., first name), and tries them in any order. A statistical attacker compiles
57
Subcategory Possible ans. Sources
Last Name (LN)
US 88,879 [cenb, cena]
China 100 [chid]
India 580 [inda]
43 European Countries 77,000 [eur, BJM10]
23 Other Countries 116,000
[asi], [sou], [nor],
[oce]
First Name (FN)
US 5,494 [cena, ssa]
China 259 [chib, chia]
India 1833 [indc, indd]
43 European Countries 72,408 [FNw, BJM10]
23 Other Countries 1,240 [FNw]
Cities (CI)
US 298 [USC]
China 642 [chic]
India 870 [indb]
U.K Cities 100 [BJM10]
Top 85 largest Cities in the world by Population 85 [wor]
Popular Places (PL)
US and World Tourist Attraction Places (Best Places to visit in US and
World)
568 [tri, nps]
Other Places (Schools, Hotels, Hospitals, Restaurants, etc) 452
[res,tope,topj,hon,
eng]
Activities/Actions (ACT)
Top 25 activities in US 25 [topo]
List of Hobbies 276 [Hob]
Objects (OBJ)
Top Hobby, Popular jobs in US, Popular Grad. Gift, Top Activities,
Best Movies, Best Singers, Best TV Series, Popular Sports in US, Pop-
ular Sports in World, Best Writer, Best Books, Popular Wedding flower,
Popular Wedding cake, Popular color of bride maid dress, Top 50 food
in the world
1,162
[topf, topc, topp,
topi, topm, topl,
Mag, toph, topg,
topb, topq, topn,
par, topr, topd, topk,
topa]
Google 20,000 words (*after removing stopwords) 19,000 [goo]
COCA 5,000 [coc]
Table 4.3: Popular lists, their sizes, and sources
ranked dictionaries of popular answers within a fact category, and tries them in the
order of popularity. A friend attacker forms guesses using its personal knowledge of
the user, and may mine some guesses from the user’s social network pages or use search
engines. A password-reuse attacker has stolen a user’s password from one server and
attempts to reuse it, in the exact or a slightly modified version, to gain access to another
server.
We assume that brute-force, statistical and password-reuse attackers are oine
attackers [BBC
+
15]. They will use automated programs to crack LEPs, just like they
58
do for passwords today. They can make as many guesses as their dictionaries allow.
A friend attacker, is an online attacker [BBC
+
15], and will attempt to guess passwords
manually. We thus assume that a friend attacker is allowed to make a small number of
guesses, before being locked out by the server.
I denote the strength of a fact against brute-force attacks as its brute-force strength,
and measure it as the number of all possible inputs in the fact category. It is challenging
to count all possible inputs, since some answers may be drawn from sets that are not
fully enumerable. For example, an answer to a ”who” question can be a relationship,
a first name, a last name, a first and last name pair, a title like Mr. and a last name, a
nickname, etc. Even within these subsets there are variations. For example, one could
combine relationship and an adjective, e.g., ”my favorite aunt” or ”my oldest uncle”.
Further, some subsets may not be fully enumerable. For example, there are publicly
available censuses of US names but not of Chinese or Indian names.
I denote the strength of a fact against statistical guessing as its statistical strength,
and measure it as: (1) the rank of the fact on a ranked list of popular facts in its category,
or (2) the brute-force strength if the fact is not found on the popular list. A challenge
lies in creating suciently large and comprehensive lists of popular facts, and ranking
them based on their popularity.
I examined many dierent data sources, seeking to identify: (1) the total number
of possible facts, and (2) the ranked list of popular facts, within each fact category. I
provide a brief explanation of my data sources here and refer the readers to the Table 4.3
for more details. My estimates and popular list sizes are shown in Table 4.2.
Brute-force strength calculation. To calculate brute-force strength, I needed the
total number of possible facts for each category. For the “first name” (FN) and “last
name” (LN) categories, I used the estimates from the U.S census [pbs, cenb], U.S.
Social Security Administration [ssa] and popular names available in 67 countries from
59
Wikipedia. The total number of FN and LN is 285,537 and 6,209,229 respectively. For
the total number of “full name” (FL) facts, I calculate a product of FN and LN counts
for each country, and sum them up arriving at 563 B possible inputs. This overestimates
the number of possible full names, since some FN-LN combinations may never occur
in practice. But, I could not find a good public source of full names, and were forced to
approximate.
For the “city” (CI) category, I obtained the list of 754,450 locations with population
greater than 5,000 people from DBpedia [dbp]. For the “place” (PL) category, I calcu-
lated the sum of the number of restaurants in the US [npd] (1,232,016), the number of
universities/colleges in US [PLUa] (7,234) and in the world (8,766) [PLUb], the num-
ber of elementary, middle, and high schools in US [PLS] (129,189), and the number of
secondary schools in UK [BJM10] (21,109). Note that this estimate does not include
other popular attractions, such as amusement parks, hotels and monuments, and is thus
an underestimate of the total number of inputs for the PL category. For the “relation-
ship” (RL) category, I built a small list of relationships (49 entries), compiled from a
dictionary. For the “object” (OBJ) and “activity” (ACT) categories, I used the size of
the Wordnet [Fel98] dictionary for nouns and verbs in English language. For the “year”
(YR) category, I assumed that the user will recount experiences, which are at most 50
years in the past. The total number of inputs in the “date” (DT) category is calculated
as 365 50. Finally, the categories “hundred” (HU) and “ten” (TN) encompass facts,
which have a limited number of possible answers (e.g., color of a bike, model of a car).
Statistical strength calculation. To calculate statistical strength I needed lists
of popular facts in each category, ranked by popularity. I used online domain-
specific sources to form these lists. I gathered around 434,000 unique popular
list items from more than 530 dierent online sources. These sources include (1)
Wikipedia/DBpedia [wik04], (2) Freebase [fre], (3) U.S. Government sources: U.S.
60
Census, U.S Social Security Administration, Dept. of Education, Dept. of Labor Stat.,
National Center for Education Statistics, (4) Other domain specific online sources: Tri-
pAdviser for popular travel destinations, Forbes and US News for educational insti-
tutions, IMDB for movie names, etc. (5) Popular English word lists from Google
20K [goo], and 5K nouns, words, and lemma from Corpus of Contemporary Ameri-
can English (COCA) [coc]. I further incorporated popular lists for dierent categories
from the Bonneau et al. dataset [BJM10]. More details are provided in Table 4.3.
Some lists had items ranked by popularity, while others did not (e.g. the list of
the 100 most popular Chinese names from Wikipedia). For unranked lists, I used the
Bing search engine to calculate the number of Web pages containing each list item. I
automatically built structured queries as a “fact category + the item” (e.g., “first name
Hao”) and mined the number of pages from the search engine’s reply. I then assumed
that the popularity of an item is proportional to the number of Web pages containing it,
and used this to rank the items in the list. While this may not be an accurate reflection
of the popularity of each item, I believe it is a good approximation for relative rankings.
I have multiple lists of popular facts per category. I calculate the strength of a LEP
fact as its lowest rank on any list. This approach assumes a strong statistical guessing
attacker, which has the best popular list for each input.
Finally, if my popular lists were too small, I would overestimate the statistical
strength of LEPs, as I would often use the brute-force strength for o-the-list facts.
I show the count of popular lists per category, and their minimum and maximum sizes,
as well as the total number of unique items in Table 4.2. For example, in the FN (first
name) category, I have 384 popular lists, ranging from 3 to 38,717 inputs, and containing
the total number of 150,695 unique names. I further evaluated how many facts collected
in my user studies were covered by my popular lists. I were able to find 75% of FN,
99% of LN, 81% of FL, 63% of CI, 54% of OBJ, 46% of ACT and 34% of PL inputs
61
on my popular lists. Thus my popular lists seem comprehensive enough for statistical
strength calculation.
4.3.3 Creation
LEP creation requires users to actively provide input, from which the system extracts
useful facts. In my work, I have investigated guided and semi-guided methods for LEP
input. These methods are triggered after a user has chosen the topic they want to talk
about and provided its title. Figure 4.3 illustrates these input methods with one specific
title “Trip to France”.
In the guided method a user is prompted with a series of questions, chosen from a
fixed set. The questions are displayed one at a time and the choice of the subsequent
questions may depend on the user’s answers to the preceding ones. This is illustrated in
Figure 4.3(a). Some questions may be open-ended, e.g., “What else do you remember
about ...”.
In the semi-guided method the user is prompted to input a certain number of facts
in the given category, and to provide a “hint” for each fact that will be used to form the
authentication prompt. This is illustrated in Figure 4.3(b).
I also investigated a freeform method, where a user inputs a paragraph of free nar-
rative, out of which I automatically extracted useful facts. However, I abandoned this
approach early since it had a large overhead for the user, and did not result in many
useful facts.
My input methods guide the user toward useful facts, such as names, locations,
objects, etc. and away from facts, which are not useful, such as preferences, opinions
and feelings. Semi-guided method allows more freedom to the user to choose facts,
which are relevant to her, but this freedom may lead to unstable facts. I evaluate these
aspects in Section 4.5.
62
Trip to France
List the first and last name of one person
that traveled with you? Nick Casey
Which year did you travel? 2015
List two cities you visited Paris, Nice
Title: Trip to France
How many memorable cities did you visit? 2
List two memorable cities you visited? Paris,
Nice
When did you travel? 2015
How many people traveled with you? 1
List the first and last name of the person that
traveled with you? Nick Casey
User Input
LEP
(a) Guided
Title: Trip to France
Enter the first and last name of one person
related to this trip and a hint: Nick Casey,
traveled with me
Enter two locations related to this trip and a hint
for each: Paris, best art, Nice, wonderful
weather
Trip to France
List the first and last name of one person that
traveled with you? Nick Casey
List a location related to "best art" Paris
List a location related to "wonderful weather" Nice
User Input
LEP
(b) Semi-guided
Figure 4.3: LEP input methods
4.3.4 Authentication
During authentication the system shows all the questions to the user, obtains the answers
and compares them against one or several stored hashes.
Let a LEP contain N facts. I require that a user recalls M facts for authentication
success, where M N, and that the strength of the recalled facts be greater than some
target value (I use 3class8 strength of 10
15
(52.55 bits) in the evaluation). The smaller
the dierence between N and M, the stronger the authentication criterion. Further, if
M < N, the system must store
N
M
hashes for one LEP. During authentication, the
system produces all possible combinations of M user answers, and hashes them. Any
match between these and stored hashes leads to authentication success. I have explored
dierent values for N and M, and provide more information about their performance in
Section 4.5.
63
Authentication may fail not only because a user forgot her answers, but also because
she recalled them imprecisely. Imprecise recall of LEPs occurs due to a high redun-
dancy of natural language, as explained below. I address some sources of mismatch
through imprecise matching, which includes normalization, keyword extraction and
reorder matching. While imprecise matching will reduce strength of LEPs, my evalua-
tion shows that resulting LEPs still have high security (Section 4.5), and that imprecise
matching significantly improves recall.
Capitalization, reordering and punctuation. A user might respond to the prompt
using dierent capitalization or punctuation than she did during password creation. We
overcome this by normalizing user answers before storage and authentication, by remov-
ing all capitalization and punctuation. A user may also list several parts of the answer in
a dierent order, e.g., she may reply “Nice, Paris” where the original answer was “Paris,
Nice”. I resolve this through reorder matching. I detect when an answer may consist of
multiple parts, and try all permutations of these parts in the matching process.
Misspelling. A user may misspell a reply during password creation or authentica-
tion. I leave handling of misspelling for future work.
Synonyms. A user may reply to a question with a near-synonym to the extracted
term, such as responding with “lake” instead of “pond”; or with a term that is more spe-
cific (hyponym) or more general (hypernym) than the expected term, such as “poodle”
instead of “dog” . I leave handling of synonyms, hyponyms and hypernyms for future
work.
Extraneous words. A user may provide extraneous words in an answer. For exam-
ple a question “what was red” may lead the user to input “my apple was red” even though
I expect just “apple” as an answer. I address this via keyword extraction. I apply key-
word extraction both during password creation and during authentication, in the same
manner. The extraction method depends on the answer type. From OBJ answers I
64
extract nouns only, from PL answers I extract nouns and out-of-dictionary words (likely
place names), and from ACT answers I extract verbs only. Other categories do not need
keyword extraction.
4.3.5 Uses of LEPs
LEPs could be used instead of passwords, but they may not be best suited for all authen-
tication tasks, because their creation and authentication are more time-consuming. One
possibility would be to use LEPs instead of passwords for first-time authentication, when
cookies have expired, when the user is accessing an online service from a new machine,
or when the user is logging onto his local machine after a logout. In these rare situations
the added overhead of LEPs may be acceptable to users, at the benefit of higher security.
Another possibility would be to use LEPs for secondary or added authentication,
instead of security questions. I show in Section 4.5 that LEPs surpass security ques-
tions in security, recall and strength against friend guessing. Many high-value services
currently use text messages sent to user phone with a code, to reduce risk of password
cracking. LEPs could be used in lieu of the code, when a users does not have access to
her phone (e.g., during international travel).
LEPs can also be used on high-security servers, such as government or bank servers.
A logged in user may be prompted for one or several facts after a period of inactivity to
verify that someone did not gain physical access to her computer.
4.4 User Studies
I evaluated LEPs through a series of user studies. I implemented LEP creation and
authentication as a Web application, so that it can be used remotely. We then ran multiple
user studies over the period of two years, with Amazon MTurk participants [mtu] and
65
with students at my institution, and used their results to refine and improve both my
user interface, and my elicitation process. All studies were approved by my Institutional
Review Board (IRB). All communication with participants was in English, and their
inputs were also required to be in English. Participants were required to be 18 years of
age to participate.
Each participant was first shown the Information Sheet explaining the purpose of the
study. Following this she registered in one of the three ways, depending on the specific
study’s design: by entering her MTurk ID, or her E-mail address or the system assigned
her a random identifier. A participant then input her demographic information (age,
gender, and native language) and proceeded with the study. In this paper I report on
the two culminating studies, which I used to evaluate LEP performance. I describe my
study design in this section and provide detailed results and their interpretation in the
next section.
Performance Study. This study was run in Fall 2015 and Spring 2016. It was
designed to evaluate strength against brute-force and statistical guessing attacks, memo-
rability and reuse of LEPs, and compare them to the same qualities of ordinary, 3class8
[KSC
+
] passwords. I recruited participants from Amazon MTurk, and asked each to
create ten LEPs and ten ordinary passwords. This scenario is unnatural, since no user
would create so many passwords in such a short time span in real life. However, asking
for ten passwords enabled us to study password reuse.
1
I asked participants to create
passwords (LEPs or 3class8 passwords) for the following ten online sites and displayed
their logos during creation and authentication: Facebook, Google+, Gmail, Outlook,
Bank of America, Chase Bank, Target, WalMart, Wall Street Journal and CNN.
1
I initially attempted a phased study design, where a participant created one password (LEP or 3class8)
and returned after one week to authenticate. After authentication the participant created another password
for the next cycle. However, I had to abandon a study idea due to high attrition rates.
66
Asking users to create passwords for fictional servers and recall them later will nec-
essarily underestimate recall in real life, because user motivation to remember these
passwords is low. I believe that this confounding factor will have a similar eect on
LEP recall as on password recall.
I required the ordinary passwords to follow 3class8 policy – being at least 8 char-
acters long, and containing characters from 3 out of 4 character classes: uppercase and
lowercase letters, digits and special characters. Each LEP was required to specify at
least 5 facts. Users were asked to return for authentication after one week. I allowed
three authentication attempts per LEP or password. I paid $1 for the creation task, and
$2 for the authentication task.
To minimize password reuse, I did not let participants select a LEP topic from a list.
Instead, I oered a topic for each LEP, which was randomly selected from my topic list.
A participant could reject the oered topics, until she finds the one that she wants to talk
about.
At registration, each participant was randomly assigned to either guided or to semi-
guided input category. She then created and authenticated with all 10 LEPs using the
same input method. The guided group was asked between 5 and 15 questions per LEP.
The semi-guided group was asked to specify two people, one location and two objects
for each LEP, and to specify a hint for each fact. After creating all 10 LEPs, each
participant was asked to create 10 passwords, for the same servers.
Participants were reminded via E-mail to return for authentication after one week.
I also invited them to return for authentication after 3–6 months to measure long-term
recall.
Friend Guessing Study. This study was run concurrently with my performance
study, using the same system. I recruited participants from my institution to conduct an
in-lab study, with the goal to measure the strength of LEPs against a friend attacker. In
67
addition to personal knowledge, I encouraged the guessers to fully utilize information
available from various social network sites and search engines.
Participants were required to enroll into the study with at least one other friend. I
advertised the study via class announcements, wall posters, and flyers. Each participant
was paid $10, and took 45-60 minutes to complete the study.
Unlike my previous studies, this study used deception in the Information Sheet
(approved by IRB), by not informing the participants that they will be guessing each
others’ LEPs. This was necessary to prevent participants from intentionally creating
LEPs, which would be either too easy or too hard for a given friend to guess. I designed
my study to mimic the real-life password use, where one does not know who may try to
guess one’s password.
After reviewing the Information Sheet, each participant was asked how close they
were with their friend on the scale from 1 to 5 (closest), and how long they knew each
other. Next, they were asked to create three LEPs for three dierent online accounts:
Gmail, Facebook and Bank of America. I displayed corresponding logos during creation
and authentication. Each participant was randomly assigned to either guided or semi-
guided input method, and created all LEPs using this method.
Next, participants were asked to authenticate with each LEP, and were allowed
unlimited number of trials, but required to make at least three. I incorporated user
authentication in the study to ensure that participants did not make up answers they
could not recall themselves.
After authentication, I debriefed each participant about the deception and explained
my reasons for this. I informed them that their friend would be guessing their password
and vice versa. They were oered a chance to quit the study at this time, and still receive
the full payment. No participants quit.
68
Row Sec. 4.5.2 Sec. 4.5.3 Sec. 4.5.2 Literature
Measure Performance Friend Guess. Passwords Security
Guided Semi-guid. Guided Semi-guid. Questions
1 Participants 44 49 47 44 93 [SBE09]
2 Passwords 440 490 141 132 930 [BBC
+
15]
3 Brute-force 161 bit (10
48
) 132 bit (10
39
) 53 bit (10
15
) -
4 Statistical 99 bit (10
29
) 82 bit (10
24
) - -
5
Recall (1 week)
all-fact 31.6% 45.7% - -
26% 32.1% –
6 five-fact 47.7% 45.7% - -
83.9%
7 four-fact 70.0% 73.0% - -
8 three-fact 82.1% 89.2% - -
9
Recall (3-6 mo)
all-fact 16.5% 32.3% - -
9% 6.4% –
10 five-fact 33.9% 32.3% - -
9% 79.2%
11 four-fact 53.0% 54.0% - -
12 three-fact 66.5% 73.6% - -
13
Friend guessing
all-fact - - 0.7% 0%
- 17 –
14 five-fact - - 0.7% 0%
25%
15 four-fact - - 0.7% 0%
16 three-fact - - 1.3% 4.5%
17 Fake info. 15.7% 11.5% - - - 37%
18 Identical (avg) 3.1% 2.7% 5.7% -
19 Similar (avg) 15.4% 4.6% 31.6%
20 Time to create (med) 112.7 s 112.0 s 16.8 s -
21 Time to succ. auth. (med) 51.9 s 37.3 s 11.3 s -
Table 4.4: Participant Statistics and Results of My Studies
Next, each participant attempted to guess her friend’s LEP’s, and was allowed unlim-
ited number of trials, and required to make at least three. Afterwards, I reviewed suc-
cessfully guessed answers with participants, asking them about their strategy. I ended
the study with a short survey about usability of LEPs. Lastly, I asked the participants
not to disclose details about deception to other students on campus, so I could continue
recruitment.
4.5 Results
In this Section, I report on the results of my two user studies. I found that: (1) LEPs
are from 10
9
up to 10
14
stronger than an ideal, randomized, 8-character password, (2)
LEPs are 2–3 more memorable than passwords, (3) LEPs are reused half as often as
passwords, (4) LEPs are 24–35 harder for friends to guess than security questions, (5)
LEPs contain 2.4–3.2 fewer fake answers than security questions.
69
4.5.1 Participant Statistics
Table 4.4 shows the breakdown of my participants in the first two rows. I show count
of participants, who completed both password creation and authentication tasks, and
the total number of passwords. I also collected demographics, age and language of
participants but found that these factors did not have significant impact on security,
memorability or password reuse. With regard to topics chosen by participants, 55% of
LEPs talk either about learning (26%), people (18%) or trips (11%), while other topics
were less popular.
4.5.2 LEPs Are Memorable and Secure
In this section I report findings from my performance study, described in Section 6.5.2.
Privacy risk. LEPs have more sensitive information than passwords and, since their
title and questions are displayed in clear, that may increase privacy risk to a user. It is
dicult to accurately measure sensitivity of LEPs, as it depends on a user’s subjective
assessment how specific information relates to her sense of privacy. Instead, I calculate
how many LEPs contain information that most people would find sensitive, such as infor-
mation about illness, incarceration, love aairs (excluding ocial boyfriend/girlfriend
and spouse information) and indecent or illegal activities. My result represents a lower
bound on sensitive information contained in my dataset. 29 out of 930 LEPs, or 3%,
contained sensitive information. Majority of these were LEPs about the death topic, and
the sensitive information was divulged in the guided LEPs, because I asked about the
cause of death. Better question design can further reduce this privacy risk.
Security. Table 4.4 shows the total number of LEPs, and their average brute-force
and statistical strength in the 3rd and 4th row. I also calculated the percentage of LEPs,
which have the higher strength than a totally random, 3class8 password. I denote this
70
strength as S
3c8
. Almost all LEPs (94–95%) exceed S
3c8
. Statistical strength of LEPs is
also quite high – it is 10
9
up to 10
14
higher than S
3c8
.
Short-term recall. I report the percentage of successful authentications, after one
week, in Table 4.4, in the rows 5–12 for LEPs and for passwords. For LEPs, in addition
to all-fact recall, I investigated three alternative authentication schemes – five-fact, four-
fact and three-fact recall. In these, the user must successfully recall 3, 4 or 5 facts,
respectively, and the statistical strength of the recalled facts must exceed S
3c8
.
Authentication success after one week is shown in rows 5–8 of Table 4.4. Pass-
word recall in my study was 26% (others have found 45–70% recall [KSC
+
, SKK
+
12,
SKD
+
14], but they asked users to recall only one password after 2 days). My findings are
consistent with psychological literature [AH11, Ebb13], where Ebbinghaus found that
people retain only 25% of new information they learned after six to seven days [Ebb13].
Imprecise matching greatly helps to increase LEP recall. With exact matching, all-
fact authentication would be 19% for guided and 9.6% for semi-guided LEPs. With
imprecise matching, it is 31.6% and 45.7%.
LEP authentication success with all facts is 30–75% higher than that for passwords.
When I require fewer facts for authentication the success rate increases significantly. At
four facts, LEP success is 2.7 higher than password success rate, and at three facts it
is 3.2 higher. Allowing users to authenticate with fewer than all facts lowers security
of LEPs. Since I also require that the statistical strength of recalled facts exceeds S
3c8
,
the remaining strength of these “shortened LEPs” is suciently large to thwart attacks.
In Section 4.5.3 I will investigate how requiring M < N facts for authentication aects
strength against friend attacks.
Security questions have a wide range of recall rates after one month – from 32.1%
for frequent flyer number to 83.9% for city of birth [BBC
+
15]. LEP recall with four-
fact and three-fact authentication is 70–89.2% and thus resembles recall for memorable
71
security questions. If a LEP were equivalent to a set containing several security ques-
tions, its best recall rate would be 83:9
4
= 50% for four-fact and 83:9
3
= 59% for
three-fact authentication. The fact that LEP recall exceeds these values shows the power
of user-customized questions and imprecise matching, over general questions and exact
matching.
To understand reasons for failures to recall a LEP, I examine recall rate per fact
category (as given in Table 4.2). Table 4.5 shows the percentage of correctly recalled
facts (in at least one attempt) per fact category in column 2. Relationships and cities
are most accurately recalled, followed by items in the HU category, places, and first and
last names. Overall, all categories except ACT have recall of more than 70% after one
week. While this is quite high, it may be puzzling why a user would fail to recall all
facts correctly, or at least at higher rates. I note some frequent reasons for failed recall
per category in column 3. Many of these could be handled by better NLP techniques,
e.g., using stemming for verbs, trying synonyms during matching, building a database
of common abbreviations (e.g., gf for girlfriend), etc. This would further improve recall,
at some security cost. I leave this direction for future work.
Cat. Recall Failure reasons Guess
FN 77% Misspelling, nickname, FL 20%
FL 81% FN, misspelling 5%
PL 82% Misspelling, abbreviations, synonyms 13%
CI 92% Misspelling, more/less specific ans 53%
OBJ 79% More/less specific ans 10%
ACT 51% Tense mismatch, miss verbs -
DT 77% Total miss 16%
YR 73% Total miss 17%
RL 95% Abbreviation 48%
HU 82% Synonym, more/less specific 21%
TN 80% Synonym 36%
Table 4.5: Fact recall and guess success per category
Overall there were 18.2% of facts, which a user failed to recall in any authentication
attempt. Users provide fake answers to security questions. They may also provide fake
answers to LEPs, that they could not later recall. While I cannot accurately establish
which facts are fake, and which are not, I estimate incidence of lying by looking for
72
facts, which a user failed to recall, and where failure cannot be attributed to the NLP
reasons I listed in Table 4.5, i.e., it is a total miss. I find that 11.5–15.7% of facts were
not recalled by users, and thus may be fake. This rate is less than a half of the fake
answer rate for security questions [BBC
+
15]. A finer investigation further shows that
about half of the failed recall cases occur because a user used an initial instead of a last
name in LEP creation, but failed to do so in authentication. I could handle this case with
better authentication prompts.
Long term recall. I invited all participants in my performance study to authenti-
cate with their LEPs and passwords once again, in May 2015. Total of 54 participants
returned. The time between creation and authentication for these return participants
ranged from 104 to 231 days, with the median of 120 days. Table 4.4 shows the long-
term authentication success in rows 9–12. While both LEP and password recall has
declined, time lapse aected recall of passwords much more than recall of LEPs. Pass-
word recall declined by 66%, while LEP recall declined by 17–47%. I thus conclude
that LEPs are more robust with regard to long-term recall, than passwords.
Security questions have a wide range of recall rates after 3–6 month – from 6.4% for
frequent flyer number to 79.2% for city of birth [BBC
+
15]. LEP recall with four-fact
and three-fact authentication is 53–73.6%, within the range of more memorable security
questions.
Reuse. I also explored strength of LEPs and passwords against a password-reuse
attacker. The results are shown in Table 4.4 in rows 18–19. I first investigated how
many out of 10 passwords were identical, for each given user. A LEP fact is said to be
identical to a fact in another LEP, by the same user, if their answers would match during
authentication (accounting for capitalization, reordering, punctuation and extraneous
words). A LEP l
1
is identical to the LEP l
2
if all of l
1
’s facts match the facts l
2
. There
were 2.7–3.1% identical LEPs, compared to 5.7% identical passwords.
73
I next investigated how many out of a user’s ten passwords were suciently similar
to each other, so that a password-reuse attacker could easily guess one if he knew the
other. Because LEPs are authenticated based on fact matches, and passwords based
on the exact string match, it is hard to devise a similarity measure that applies equally
well to both concepts. To define similarity of two passwords, I borrow from the Linux
Pluggable Authentication Modules (PAM) [PAM] design. I say that two passwords p1
and p2 are similar, if more than 1/2 of items in p2 also appear in p1. For passwords,
items are characters and for LEPs, items are facts. This definition allows us to directly
apply pam cracklib to detect similar passwords. I say that op1 is similar to op2 if
at least one of the following conditions hold: (1) more than 1/2 of op1’s characters
appear in op2, 2) op1 is a palindrome of op2, 3) op1 is op2 rotated, 4) op1 diers
from op2 only in case. There were 4.6–15.4% similar LEPs, in semi-guided and guided
categories, respectively, compared to 31.6% similar passwords. Thus passwords were
reused more than twice as often as LEPs, and guided LEPs were reused 3 more often
than semi-guided LEPs.
Time to create and authenticate. I show the median time to create and authenticate
a LEP or a password in Table 4.4, in rows 20–21. LEPs require 6.7 longer to create
and 3.3–4.6 longer to authenticate, than passwords. This is expected as they require
a user to both read the questions and to provide input that is approximately five times
longer than a password.
Storage. LEP answers should be concatenated and stored as one or several hashes.
In case of all-fact authentication, I store only one hash per LEP. If I allow for M-fact
authentication (M=3, 4 or 5), I would have to create, hash and store
N
M
combinations
of facts for each LEP, where N is the number of all facts. Authentication process would
then have to attempt to match any of these hashes to succeed. 77% participants would
need up to 35 hashes, 88% would need up to 70, 92% would need up to 126 hashes,
74
and the worst case scenario would require 1,287 hashes. Even in the worst case, the
storage cost would only amount to several Kilobytes per user, which is negligible for
today’s server storage. Because one-way hashing is fast, the processing cost should also
be acceptable. I could further limit the storage cost of LEPs by discarding all but N
strongest facts, at creation time. For M = [3; 4] and N = 8 each LEP would require at
most 70 hashes.
4.5.3 LEPs Are Strong Against Guessing
This section discusses results of my friend guessing study, described in Section 6.5.2. I
recruited a total of 91 participants, forming 100 dierent pairs. A few participants came
in groups of 3 or more people. All of the participants completed the study in the same
sitting. The participants were students from my institution from freshmen to graduate
students, with majors in engineering, social science, theater, biology, math, international
relations, music, business, economics, psychology, and linguistics.
Participants knew each other between 3 months to 6 years. All of them were already
friends on at least one social network (e.g., Facebook, Instagram, etc) and were encour-
aged to use social networks and public sources during guessing. For participant pairs
that were not from the US, both participants were from the same country and shared the
same cultural background, which helped them make more educated guesses. Average
closeness rating on the scale 1–5, with 1 being the lowest, was 2.6 for the guided group,
and 3 for the semi-guided group.
Most users were able to fully authenticate with their LEPs in the second stage of my
study. A few authentication failures I observed were due to misspelling, synonyms and
more/less specific answers provided during authentication than during creation.
I show friend guessing success in the Table 4.4, rows 13–16 using the same authenti-
cation schemes as in Section 4.3.4. Guess success rate is very low (0–0.7%) for all-fact
75
authentication, and climbs to 1.3–4.5%, for three-fact authentication. Taken together
with recall results, this data suggests that the four-fact authentication seems to strike the
right balance between achieving high authentication success, and reasonable strength,
while keeping the friend guessing success low.
Compared to security questions, where friends could guess 17–25% [SBE09], LEP’s
are 3.7–35 stronger, assuming four-fact or three-fact authentication.
Guess success rate per fact category, for facts described in Table 4.2, is shown in col-
umn 4 of Table 4.5. Friends could successfully guess more than half of the cities, and
more than 20% of relationships, places, first names, and facts in HU and TN categories.
Other categories had lower guess success, especially first and last names. This shows
another strength of LEPs over security questions. Bonnie et al. found that security ques-
tions could not strike the right balance between security and memorability, because facts
that were memorable for users were also easily guessed [BBC
+
15,SBE09]. Conversely,
LEPs have multiple fact categories with high user recall and low friend guess rate (e.g.,
FN, FL, PL, OBJ).
Participant Feedback. I carefully observed participant behavior during guessing
and interviewed them about their strategy after the study. Except two participants,
who attempted random guesses, the rest invested significant eort in looking for pos-
sible guess options online. They reported using the following information sources for
guessing: personal knowledge, Facebook, Google search engine, Instagram, RenRen,
WeChat, QQ, Line, LinkedIn, and Spokeo. Overall, 78–83% of participants reported
using personal knowledge, 76–79% used social networks and 50–56% used search
engines. At social networks, participants checked the personal profile and friend lists
first, and then scanned the recent wall posts. They complemented this with online
searches for popular items which appeared in LEP questions.
76
I further collected participant feedback on why guessing LEPs was hard. About half
of responses (48%) stated that LEPs involved too much detail about the topic, which
was hard to mine from online sources or from personal knowledge. Unless the friend
were personally involved in the event, it was dicult to mine correct answers online.
In addition, 26% of participants acknowledged that many facts were too personal and
thus not shared among friends, nor used in social network posts, such as events from
early childhood and elementary school. These private and unique facts make guessing
dicult, unlike facts used for security questions, which can be easily found in online
sources.
When asked about what they liked and disliked about LEPs, 58.2% of participants
stated that LEPs were much harder to guess than the current security questions, and more
memorable than ordinary passwords since they were built from personal experience.
In addition, 11.4% participants said that the variety of question sets and their detail
were another advantage of LEPs over security questions. 6.3% participants reported
that it took time to come up with the life-event and fact choices, and 19% of participants
worried that they might forget the exact answers for LEPs.
More than 90% of participants reported that they would consider using and adapting
LEPs for dierent online accounts, while 6.3% said that they did not plan to use LEPs
due to time overhead and concerns about memorabilty. 44.3% were willing to use LEPs
for an online banking account, which needs high security, and has less frequent logins,
31.6% said they would use LEPs for secure and professional email accounts, and 17.7%
would use them for government accounts and health records.
77
4.6 Conclusions
Textual passwords are a widely-used form of authentication and suer from many defi-
ciencies because users trade security for memorability. Users create weak passwords
because they are memorable, and reuse the same password across many sites. Forcing
users to create strong passwords does not help, as these are easily forgotten.
I have proposed life-experience passwords (LEPs) as a new authentication mech-
anism, which strikes good balance between security and memorability. I investigated
several LEP designs, and evaluated them in two user studies. My results show that LEPs
are much more memorable and secure than passwords, they are less often reused, and
they are strong against friend guessing. While they take more time to create and input
during authentication, I believe their benefits may make them a viable primary authen-
tication mechanism for high-security servers, or much better secondary authentication
mechanism than security questions.
78
Chapter 5
Mnemonics Passphrase (MNPass)
5.1 Introduction
Textual passwords are widely used for today’s user authentication. Users are advised to
choose long character sequences and use characters from multiple classes (e.g., special
characters, letters, numbers) to make passwords hard to guess. Because users have many
online accounts today (an average of 25 in 2006 [FH07] and a median of 40 in 2016 by
our estimation), remembering passwords is challenging. Users reuse existing passwords,
and include predictable word and character patterns [RJK13a].These practices decrease
security of passwords against automated guessing to a much lower value than expected
solely based on length and composition.
One way to make passwords more secure is to make them longer. Longer passwords
should be harder to guess by automated attacks, as the guessing space will be larger.
A passphrase is one example of a longer password, and is usually made by stringing
several words together. These words could be unrelated, e.g., “mother chicken apple,” or
form a sentence, e.g., “I love apple juice.” Passphrases also tend to be more memorable
than passwords, as they may contain expressions familiar to a user (e.g., verses of a
favorite song) and follow grammatical rules [KSS09, KSS07, SG94, ZH93]. Therein
lies the problem, though! The underlying grammatical structure of passphrases and
use of common phrases (e.g., verses from songs) lower their security well below the
security expected by length alone [RJK13a, KRC]. And if these patterns in passphrases
are broken by forcing users to use system-generated passphrases, security increases but
79
recall plummets [SKK
+
]. Thus, there appears to be a trade-o between passphrase recall
and security. It is hard to improve one without jeopardizing the other.
For this research, I explored the use of mnemonics – multi-letter abbreviations of
passphrases, to improve both recall and security. I form a mnemonic out of the first let-
ters of each word in a passphrase. For example, a passphrase “I love apple juice” would
be abbreviated to the mnemonic ILAJ. I first explore the use of mnemonics as authen-
tication hints to aid user recall. I further explore use of mnemonics during passphrase
creation to constrain user choices to only those passphrases which match the mnemon-
ics. I expect that this approach will reduce the presence of grammatical constructs or
common phrases in passphrases, and thus improve their security.
I evaluated mnemonics-aided passphrases in several user studies, which were
approved by an IRB, and compared those passphrases against user-chosen and system-
chosen passphrases. When displayed as user hints during authentication, mnemonics
improved recall by 30–36% after three days, and by 51–74% after seven days. When
users are asked to generate passphrases, which match a given mnemonic, use of common
phrases is reduced from 50% to under 5%. I can combine these two uses of mnemonics
to arrive at passphrases, which have good recall and high security (low use of common
phrases). While mnemonics, as authentication hints, lower security against statistical
guessing attacks, I can recoup this loss, while retaining high recall, by allowing the sys-
tem to generate one word of the passphrase, or by requiring longer passphrases. Further,
users find that mnemonics improves usability of passphrases.
5.2 Related Work
There is much related work on passwords and alternative authentication methods. I
discuss here only those works that are closely related to my proposed use of mnemonics.
80
Rao et al. [RJK13a] discovered that long passwords have a distinct grammatical
structure. They analyzed part-of-speech (POS) tag sequences from the Brown Corpus
[FK79] and found that the grammatical structure decreases search space for passwords
by more than 50%. Veras et al. [VCT14a] explored semantic patterns of passwords and
showed how these patterns can be used to greatly improve attack success.
Bonneau and Shutova [BS12] studied short user-chosen passphrases (2+ words), and
showed that they are vulnerable to dictionary attacks, and that they have simple noun
structure. My work focuses on longer passphrases (5+ words), which I find to have a
more complex sentence structure.
Shay et al. [SKK
+
] found that both system-generated passphrases and system-
generated passwords are annoying to users and easy to forget. My studies confirm this
finding, but I focus on evaluation of mnemonic-aided passphrases.
Cued-recall systems (e.g., [CvOB07, CFBvO08, DMR04b]) have been proposed for
graphical passwords, as summarized by Biddle et al. [BCVO12]. In these systems a
user is shown an image or a set of images as a cue, and must recall which points on the
image she clicked, or which images she selected, in order to authenticate. Bicakci and
van Oorschot further propose gridWords [BvO11], textual, multi-word passwords that
can be entered by selecting them from a dropbox or by locating them on a grid, which
serves as a cue. My use of mnemonics as hints is also an example of cued-recall, but the
one applied to textual passphrase and using a textual cue.
Kuo et al. [KRC] researched the mnemonic passwords, which are derived as abbre-
viations of common phrases such as movie titles. Kuo et al. found 65% of mnemonic
passwords via Google searches. I show that my mnemonic passphrases do not suer
from the same deficiency – fewer than 7.5% can be found using Bing searches.
User training has also been shown to improve password recall [BKCD14,DHS], and
it may use mnemonic techniques. My mnemonic structure diers from these works (I
81
use word abbreviations and not visual cues or narratives) and I use mnemonics to guide
creation or as authentication hints, rather than for user training.
5.3 Mnemonics and Passphrases
In this Section I define passphrases and mnemonics, and describe how I use mnemonics
to improve recall and security of passphrases.
5.3.1 Passphrases
A passphrase is a sequence of characters, usually much longer than a password, used
for authentication. Passphrases can contain any character, but usually all characters are
alphabetic. Passphrases can further contain capitalization, punctuation, numbers and
special characters. In this work, I focus on letter-only passphrases and I normalize them,
by removing capitalization and punctuation from user input. Thus, the only information
used for authentication is the alphabetical content of the passphrase. This allows us to
reason only about the content, which carries most of the meaning for the user, and how
this content aects recall and security.
A passphrase with letter-only content will consist of words. Most of these words
will come from the user’s natural language, and may be found in a dictionary or may be
proper names of people, objects and locations with significance to the user. The words
in a passphrase could be separated by spaces or they could be input together by the user
and segmented using a semantic classifier (e.g., [VCT14a]).
82
5.3.2 Mnemonics
Mnemonics improve recall of information, by associating it with other representa-
tion, such as abbreviation, a rhyme, or a pattern. In my work, I use abbreviations of
passphrases as mnemonics, which I create out of the first letters of passphrase words.
5.3.3 Using Mnemonics
One could use of mnemonics in two ways. First, they can be used as user hints, to
improve recall (aka recall cues [BCVO12]) – I will call these hint-mnemonics. A user
chooses a passphrase, and the system creates a hint-mnemonic and stores it with the
passphrase. For example, a user may choose “Mom loves apples and oranges” and the
resulting hint-mnemonic becomes “MLAAO”. At authentication, the system prompts
the user for her passphrase, and displays the hint-mnemonic. Figure 5.1(c) illustrates
regular authentication with no hints, and Figure 5.1(d) illustrates authentication with
hint-mnemonics.
Use of mnemonics as hints will lower security. Passphrases, like passwords, are
stored hashed and salted, but mnemonics must be stored in clear since they are displayed
to users during authentication. Thus a statistical attacker (see Section 5.4) can tailor
his guessing to words starting with letters of the mnemonic, which greatly reduces the
search space. I evaluate this security cost in Section 5.6, and propose ways to recoup it.
The second possible use of mnemonics is during passphrase creation – I will call
these guide-mnemonics. Because users tend to use common word sequences, popular
phrases, and grammatical rules in passphrases [RJK13a,BS12,KRC], many passphrases
can be guessed by mining these common patterns from public sources. Mnemonics can
be used during creation to improve randomness of word choices in passphrases, and to
reduce the reuse of passphrases across dierent accounts.
83
Username: johnsmith
Passphrase: momlovesapplesandoranges
Please choose a username and a passphrase:
CREATION
(a) Regular passphrase creation
A___ B__ A__ L__ O__
pples read nd ox rder
Username: johnsmith
Passphrase:
Please choose a username and a passphrase.
Your passphrase must contain words starting
with the displayed letters
CREATION
(b) Creation with guide-mnemonics
Username:
Passphrase:
Please enter your username and passphrase.
AUTHENTICATION
(c) Regular passphrase authentication
Username:
Passphrase:
Please enter your username and passphrase.
Hint: Your passphrase contains words
starting with letters MLAAO
AUTHENTICATION
(d) Authentication with hint-mnemonics
Figure 5.1: Dierent passphrase creation and auth. methods using hint- and guide-
mnemonics
Guide-mnemonics are generated by choosing letters from the alphabet according
to some algorithm. A user is then prompted to generate a passphrase matching this
mnemonic. Each passphrase word must start with one mnemonic letter, in order. For
example, a system may generate a guide-mnemonic “ABALO” and the user may input a
matching passphrase like “Apples bread and lox order”. Figure 5.1(a) illustrates regular
passphrase creation, and Figure 5.1(b) illustrates creation with guide-mnemonics. I fur-
ther allow extraneous passphrase words, which are not part of the mnemonic, because
they may aid recall. For example, a user may input “Apples bread and the lox are
ordered,” with “the” and “are” being extraneous words. If guide-mnemonic is also to
be used as a hint-mnemonic, I adjust it before storing to reflect all the passphrase words
(e.g., ABALO becomes ABATLAO).
5.4 Attacker Models and Strength
In this Section I discuss my attacker models, and passphrase strength metrics.
84
5.4.1 Attacker Models
I consider the following attacker models, also used in past research [BBC
+
15,BHvS15].
A brute-force attacker tries all possible passphrases in random order. Without
mnemonics, the strength against brute-force attacks is c
l
, where c is the number of
characters in the passphrase alphabet and l is the average passphrase length in char-
acters. Since I consider normalized, letter-only passphrases, c would be 26. With hint-
mnemonics, a brute-force attacker would try passphrases containing all possible words
starting with the given letters in the mnemonic. A brute-force attacker model grossly
overestimates passphrase and password strength, as shown in [BS12, RJK13a]. I thus
do not use it for calculation of passphrase strength, but I use it to determine length of
guide-mnemonics (see Section 5.5.3).
A statistical attacker compiles lists of common sequences of words, and tries them
in order of popularity. I further distinguish between: (1) a language-model (LM)
attacker, which compiles probabilities for word sequences occurring together, and uses
these to guide his guessing, and (2) a phrase-dictionary attacker [KRC], which com-
piles lists of common phrases from online content, and tries them whole or in part.
5.4.2 Strength Against Attacks
Because brute-force attacks are suboptimal for the attacker, I do not evaluate strength
against them. For statistical attacks on passwords, prior works use ‘heuristic measure
of password strength” [BHvS15], also known as the guesswork [Bon12a], which mea-
sures the expected number of guesses until authentication success. Pliam [Pli] suggests
using-work factor as a measure of password strength against an ideal attacker, which
knows probability distribution of passwords in the specific corpus. This is unrealistic,
and also underestimates strength of passwords in small corpuses, like the one that I
85
have. I use the guess number measure of strength by Dell’Amico and Filippone [DF15],
which estimates the number of guesses until success using the sampling method over a
probabilistic password model. They have shown the accuracy of such estimated strength
against state-of-the-art attacks.
For a language-model attacker, I calculate the maximum probability of each
passphrase, among all possible passphrases of the same word-length from 3.6B words
corpus, using an n-gram model. I then convert this probability into guess number using
Monte Carlo Sampling as proposed in [DF15].
For a phrase-dictionary attacker, I cannot calculate guess number directly, since it
is hard to build a large-enough dictionary of common phrases. Instead, I measure the
longest overlap between each passphrase and my collection of common phrases. The
longer the overlap, the lower the strength against phrase-dictionary attackers, as there
are fewer words that the attacker must guess using other methods.
5.4.2.1 LM Attack Strength.
I regard a passphrase as a sequence of words in English language. I then build a language
model (LM) [Jur00] to estimate the probability of this sequence of words in English. A
language model is a probability distribution over words and word sequences [Jur00].
The n-gram model estimates the probability of every n-word sequence in the corpus,
with the probability of single words being their frequency of occurrence in the corpus.
The n-gram model captures popular grammatical constructs (e.g., “I love”), and word
co-occurrence (e.g., “apple juice”, “iced latte”).
I built a unigram, bigram, and trigram language model from my corpus (higher-order
models are prohibitively expensive to build and have a high number of out-of-vocabulary
sequences). I then calculate the probability of each passphrase using these three models,
86
and convert the largest measure to guess number. I assume that an LM-attacker can build
similar models, and select her guesses in the decreasing order of probability.
When building a language model, it is critical to have a large amount of training
data. Otherwise, there would be many out-of-vocabulary (OOV) words and sequences,
and my model would significantly underestimate probability of passphrases, and over-
estimate strength against statistical attacks. I build my language model using several
widely-used, large sources: (1) UMBC WebBase corpus [HKF
+
13] – a collection of
English paragraphs, obtained from February 2007 crawl, containing 100 million web
pages from more than 50,000 websites, and 3.6 B words, (2) “One Billion Word Lan-
guage Modeling Benchmark” [CMS
+
13] – a collection of English paragraphs, obtained
from WMT 2011 News Crawl data, (3) the texts from Gutenberg Top 100 books [gut]
– a collection of famous books, (4) the Brown Corpus [FK79] - containing 500 samples
of English-language text, totaling roughly one million words.
After obtaining a maximum probability of a passphrase from LM, I convert it into
guess number by using Monte Carlo Sampling on a corpus of 100,000 randomly-
generated passphrases, as described in [DF15].
LM Adjustments for Mnemonics. When mnemonics are used as hints during
authentication, this changes my language models as some words and some sequences
become invalid. I adjust my language models for each passphrase by re-normalizing the
word distributions, so I do not overestimate the strength. Let M be the hint-mnemonic
for one given passphrase P
M
. I adjust my corpus for P
M
by keeping only those unigram,
bigram, and trigram sequences, which contain the words starting with letters in M, and
which follow the order of letters in M. I then use these sequences to build my language
models and calculate the probability of P
M
.
87
Out-of-Vocabulary Words. Any language model must cope with out-of-vocabulary
(OOV) n-grams. I apply the standard natural language processing approach to this prob-
lem, by adding unknown words and n-grams to corpus, and using the smoothing to esti-
mate probability of OOV n-grams [Jur00]. This essentially assigns the lowest, non-zero
probability to OOV n-grams out of all n-grams in the corpus.
5.4.2.2 Phrase-Dictionary Strength.
Prior research shows that users use common phrases, such as famous quotes, lyrics,
poems and movie titles, for generating mnemonic passwords [KRC]. The same trend
could occur for passphrases. In a phrase-dictionary attack, the attacker may use only a
portion of a common phrase, or use it in its entirety.
I calculate the overlap between a passphrase P and a phrase dictionary as:
overlap(P) = argmax
i
(WO(P; phrase
i
)); (5.1)
where phrase
i
is the i-th phrase in the dictionary, and WO is the function that
returns the longest word-overlap between P and phrase
i
. For example, if P=“Today
early bird catches a fly and worm” and phrase
i
=“Early bird catches a worm”, then
WO(P; phrase
i
) = 4. The overlap is calculated ignoring capitalization and punctuation,
i.e. the passphrase and the common phrases are both normalized. The overlap, however,
must occur in sequence and without gaps. While I cannot convert this measure into
guess number, the measure correlates inversely with passphrase strength against phrase-
dictionary attackers. Attackers can leverage a phrase-dictionary to speed up guessing
in the following manner. Assume an attacker knows or can guess the word-length of a
passphrase, e.g., five words. First, the attacker would create a phrase-dictionary and try
all five-word subsequences of each phrase. If these do not lead to success, she would try
88
all four-word subsequences of each phrase and she would try to guess the word in the
fifth or the first position using brute-force or language-model search. If none of these
result in authentication success, the attacker would proceed to three-word subsequences,
etc.
I built two dictionaries of common phrases. First, I collected
famous quotes, lyrics, poems, popular movie titles, and quotes
from [Rob], [bqu], [poe], [lov], [Wei], [BS12], [imd] into my “Famous phrases”
dictionary. This dictionary contains 280,550 phrases. Second, I used the Bing search
engine to search for each passphrase in my study, select the top 10 matching pages
and insert them into my “Popular pages” dictionary. I will show my results in the later
section.
5.5 Passphrase Models
I now provide more details about dierent passphrase creation and authentication mod-
els, which I consider in my work (summarized in Table 5.1). My baseline is the UPass
model – where users select a passphrase freely and receive no hints at authentication.
I then transform this model into UPassHint – where users freely select a passphrase,
but are shown a hint-mnemonic at authentication. These two models help us evalu-
ate how much hint-mnemonic aid recall. I next consider use of mnemonics both as
guide-mnemonics and as hint-mnemonics. This model allows us to evaluate the eect
of guide-mnemonics on passphrase strength; they do not aect recall. I further consider
cases where a system generates one or all words in the passphrase, to increase the secu-
rity of the passphrase. My generic model MNPass(s) uses the parameter s to denote
the number of words generated by the system (zero or one). Finally, I explore how
passphrases entirely generated by the system aect security and recall, and how much
89
Model
Created
Example
Auth.
User Sys hint
User-chosen passphrases
UPass all 0 My Cat Is Very Funny no
UPassHint all 0 Apple Banana Orange yes
Grape Pear (ABOGP)
Mnemonics-guided passphrase choices
MNPass(0) all 0 Important Uganda Greg yes
Arbitrary Bountiful (IUGAB)
MNPass(0)-Long all 0 She Uses Lemon yes
Polish High Up Right (SULPHUR)
MNPass(1) all -1 1 European Union Strange yes
Postings Online (EUSPO)
System-chosen passphrases
SysPass 0 all Omnipresent Texture Monaco no
Narcotic Disney
SysPassHint 0 all Precocious Base Graze yes
Blazoned Specialty (PBGBS)
Table 5.1: Passphrase models considered in my work.
hint-mnemonics can help the recall of such passphrases. These are my models SysPass
and SysPassHint.
5.5.1 Baseline: User-Chosen Passphrases
Today, users freely select a passphrase, and the system only enforces a given length
policy, or requires presence of specific character classes At authentication, users are
prompted for a passphrase, and are not shown any hints. This is my baseline UPass
model, which I seek to improve with regard to recall and security.
5.5.2 Improve Recall: Authentication Hints
I allow users to freely select all the words in a passphrase. The system segments the
input passphrase into words, and abbreviates each word to its first letter. The letters are
90
then concatenated in order to form a hint-mnemonic. The hint-mnemonic is shown to
the user at authentication time. I refer to this passphrase model as UPassHint.
5.5.3 Improve Security: Mnemonic-Guided Passphrase Creation
I first generate a guide-mnemonic, by choosing letters from an alphabet, following some
algorithm. I then ask a user to choose passphrase words in order of the mnemonic
letters, and ensure that each word starts with the given letter. The system then stores the
mnemonic as hint-mnemonic in clear and associates it with the user’s username, just like
a password salt is stored today. At authentication time, the hint-mnemonic is displayed
to aid recall. I call this passphrase model MNPass.
Mnemonics Generation. my goal is to generate mnemonics, which result in
passphrases exceeding some target strength (TS) against a brute-force attacker.
The length and composition of letters in mnemonics will directly influence recall,
diversity and strength of passphrases. The longer the mnemonic, the longer and more
secure the passphrase. However, too long passphrases may lead to reduced recall. Fur-
ther, each mnemonic letter guides a user to choose a word starting with this letter. Thus
the guess space of a passphrase is the product of the guess spaces for each word. To
correctly estimate a guess space for each word I must carefully choose a dictionary. Too
small a dictionary will underestimate the guess space [SKK
+
], and too large a dictionary
will overestimate it.
I use Google 20K [goo] data set, which includes the most common 20,000 English
words in order of frequency, calculated over the Google’s Trillion Word Corpus. I
believe this data set suciently well represent the most popular English words in inter-
net. I further pre-filter this dictionary to exclude words starting with q, x, y, and z, which
have too fMy ew word choices. This leaves us with 22-letter alphabet for mnemonic
generation.
91
I generate mnemonics by randomly selecting letters, one by one, from my 22-letter
alphabet, and estimating the strength (guess number) of the resulting passphrase. For a
mnemonic containing k letters the estimated strength ES is calculated as:
ES =
k
Y
i=1
words(dict; mn(i)) (5.2)
where k is the number of letters in the mnemonic, mn(i) returns the letter at position i
in the mnemonic, and words(dict; mn(i)) returns the number of words from dictionary
dict, which start with the letter mn(i). When ES exceeds the target strength TS, I stop
and output the mnemonic.
System-Generated Words. Mnemonic-guided passphrase creation may still lead to
low passphrase strength if users build their passphrase out of very popular words. I thus
explore the password model where the system chooses one word in a passphrase, and the
user chooses the rest. To accommodate this model, I parameterize MNPass by s, where s
is the number of system-selected words. I explore MNPass(0) and MNPass(1). System-
selected words may have low recall, as they lack personal significance for a user. My
evaluation shows that this is not the case for MNPass(1), i.e., letting the system choose
only one word does not greatly lower recall.
Longer Passphrases. I explore another approach to increase passphrase security,
by requiring 20% longer passphrases in approach MNPass(0)-Long.
5.5.4 Improve Security: System-Chosen Passphrases
One can also improve passphrase security by letting the system to generate the entire
passphrase. I explore this approach without authentication hints – SysPass model – and
with hint-mnemonics – SysPassHint model.
92
5.6 Evaluation
my
Model Participants Average Average Average
words. / pass. chars. / pass. chars. / word.
All User All User All User
User-chosen passphrases
UPass 44 5.3 24.5 4.6
UPassHint 56 5.4 25.6 4.8
Mnemonics-guided passphrase choices
MNPass(0) 66 5.2 26.2 5.0
MNPass(0)-Long 51 6.9 36 5.2
MNPass(1) 62 5.0 4.0 30.6 23.3 6.1 5.8
System-chosen passphrases
SysPass 58 5.1 0 38.3 0 7.4 0
SysPassHint 56 5.2 0 39.0 0 7.6 0
Total 393
Table 5.2: Basic statistics on passphrases per passphrase model
I used Amazon Mechanical Turks to evaluate recall and strength of dierent
passphrase models, described in the previous Section. All my user studies were reviewed
and approved by my Institutional Review Board (IRB). I found that hint-mnemonics
improve recall by 30-36% after three days, and by 51-74% after seven days (Section
5.6.2). I further found that guide-mnemonics reduce presence of common phrases
in passphrases from 50% to under 5% (Section 5.6.3). Finally, users reported that
mnemonics are easy to use and helpful (Section 5.6.4).
5.6.1 User Studies
Amazon Mechanical Turk participants were assigned at random to one passphrase model
from the previous Section. I recruited participants with at least 1,000 completed Human
Intelligence Tasks (HITs) and >95% HIT acceptance rate. I asked each participant to
93
create one passphrase in one sitting. The participant was then asked to return after three
days and after one week to authenticate. I paid 35 cents for passphrase creation and 40
cents for each of the two authentication tasks.
Limitations and Ecological Validity. My study had the following limitations, many
of which are common for online password studies. First, it is possible but very unlikely
that a participant may enroll into my study more than once. While the same Mechanical
Turk user could not enter the study twice (as identified by her MTurkID), it is possible
for someone to create multiple Mechanical Turk accounts. There is currently no way to
identify such participants. Second, I cannot be sure that my participants did not write
down or photograph their passphrases. I did not ask the participants if they have done
this in post-survey, because I believed that those participants who cheated would also
be likely to not admit it. I designed my study to disincetivize cheating. I promised to
pay participants in full regardless of authentication success. I study mechanisms further
detected copy/paste actions and I have excluded any participant that used these (for
whatever reason) from the study. I also reminded the participants multiple times to rely
on their memory only. If any cheating occurred it was likely to aect all the results
equally. Thus my data can still be used to study improvement of recall and security
between password models. Third, while I asked Mechanical Turkers to pretend that
they were creating passphrases for real servers, they may not have been very motivated
or focused. This makes it likely that actual recall of real-world passphrases would be
higher across all models. While it would have been preferable to conduct my studies
in the lab, the cost would be too high (for us) to aord as large participation as I had
through the use of Mechanical Turks.
Passphrase model parameterization. I wanted to make passphrase models com-
parable to each other in character-length. This allowed us to investigate factors like
94
recall and security of passphrases, independent of length. The length of MNPass mod-
els is controlled by the target strength (TS) parameter. I use TS = 95
8
, which is the
theoretical, maximum strength against brute-force attacks for 3class8 passwords. These
passwords are generated according to the frequently-used password policy: 8-character
length and use of at least three out of the following four character classes: lowercase
and uppercase letters, digits and special characters. While the exact mnemonic length
depends on the randomly chosen letters of the mnemonic, most of my mnemonics were
5–6 characters in length for TS = 95
8
, and thus led to 5–6 word long passphrases. To
make the other models comparable, I instructed participants to enter passphrases, which
contain at least five words. Also, for models where all words are chosen by the system,
I mimicked the mnemonic-guided creation with TS = 95
8
. I generated the mnemonic,
and then drew all the words for the passphrase at random from my dictionary, to match
the mnemonic. The MNPass(0)-Long model explores longer passphrases. I used target
strength TS = 95
10
for this model, which mimics 3class10 passwords and which led to
passphrases with 6–7 words.
Passphrase Creation. I provided a short tutorial and examples for each passphrase
model and then asked participants to create one passphrase. All users were asked not
to write down or copy their answers, and to rely on memory only. I automatically
checked if passphrase constraints (e.g., word length, mnemonic match) were met during
passphrase creation, and on failure, I asked the user to recreate the passphrase.
Passphrase Authentication. Each user was asked to make two authentication visits,
one after three days, and one after one week since passphrase creation. I allowed at most
five trials to authenticate per passphrase and per visit. All users were asked not to paste
their answers. I had automated detection of copy or paste attempts in my forms, and I
rejected the users who were detected to perform either of these two actions. I further
displayed a notice to participants, at both the creation and authentication screens, that
95
they will get paid regardless of their authentication success. This ensured that partici-
pants had no monetary incentive to cheat. After the second authentication visit, I asked
participants to complete a usability survey.
Participant and Passphrase Statistics. In total, there were 1,273 participants who
created a passphrase. Out of 1,273 participants, 731 (57.47%) participants returned
for the first authentication and 426 (58.3%) returned for the second authentication.
Total of 393 participants completed both the first and the second authentication, and
I only include their data in my analysis. Some basic statistics for the passphrases per
passphrase model are shown in Table 5.2. I had 44–66 participants per password model.
Except for MNPass(0)-Long, all other models generated passphrases of comparable
word-length. User-chosen words tended to be shorter (4.21–5.8) than system-chosen
words (7.4–7.6 characters), which directly maps into dierences in character length
between user-chosen passphrases, and those that had some words chosen by the sys-
tem.
Similar to the prior research [BS12, RJK13a] I evaluated the semantic structure
of passphrases, by applying part-of-speech (POS) tagging and dependency parsing. I
grouped the passphrases into three categories: (1) noun sequence (contain nouns only),
(2) sentence (contain subject, verb and object) and (3) segment (all others). For space
reasons, I only summarize these results. About half of user-chosen passphrases follow
the sentence structure, and about a third are sequences of nouns. Conversely Bonneau
and Shutova found in [BS12] that most short passphrases were segments. The dier-
ence between my findings and theirs is likely due toMy use of longer passphrases (5+
words instead of 2+). Mnemonic-guided passphrase creation disturbs the user tendency
towards the sentence structure, and leads to more balanced distribution of semantics
structures (towards noun sequences and segments), which increases attacker’s guessing
96
Passphrase Model
Authentication success
exact match relaxed match
3 day 7 day 3 day 7 day
w/o hint
UPass 52.3% 40.0% 63.6% 45.0%
SysPass 20.7% 12.5% 22.4% 14.3%
w hint
UPassHint 71.4% 69.6% 76.8% 73.2%
SysPassHint 26.8% 18.9% 28.7% 19.6%
MNPass(0) 69.7 % 66.7% 80.3% 66.7%
MNPass(1) 69.3% 67.7% 75.8% 69.3%
MNPass(0)-Long 66.7% 62.8% 78.4% 72.5%
Table 5.3: User recall three days and seven days after passphrase creation
eort. Longer (6-7 word) passphrases tend to favor sentence structure as much as user-
chosen passphrases, but they have higher percentage of segments and lower percentage
of noun sequences. System-chosen passphrases all have the segment structure.
5.6.2 Mnemonics Improve Recall
I considered recall successful if users matched the entire passphrase in its normalized
form – with removed capitalization, punctuation and whitespaces. I denote this match
criterion exact match. I also considered a relaxed match criterion, where I normalize
nouns to their singular form, and verbs to their stem form using the Porter stemming
algorithm [por], before both storing and authentication. I hypothesized, and my results
prove, that this relaxed matching further improves recall, and does not greatly decrease
strength against statistical attacks. Table 5.3 shows the authentication success per model,
for the exact and the relaxed matching.
Hint-Mnemonics Improve Recall. Within the same passphrase model (UPass
vs. UPassHint, SysPass vs. SysPassHint) there was a statistically significant increase
97
(Welch Two Sample t-test, 95% confidence interval), in recall rates when hint-
mnemonics were used. Recall rates of UPassHint were higher than those of UPass
(t(88)=1.76, p=0.04), and recall rates of SysPassHint were higher than those of SysPass
(t(85)=2.75, p=0.003). Hint-mnemonics improve recall by 30–36% after three days.
The role of hint-mnemonics becomes more prominent as time goes on. After seven
days, hint-mnemonics improve recall by 51–74%. Improvement is also greater for user-
chosen passphrases, than for system-chosen ones. User-chosen passphrases have per-
sonal significance to the user, and hints help users recall this association and thus recall
the passphrase. Over four days (day 3 vs day 7 recall), recall of user-chosen passphrases
declines by 24%, but when hints are used, recall declines only by 3%. System-chosen
passphrases lack personal significance. Over four days, their recall declines by 40%,
without hints, and by 30% with hints. Overall, recall rate of system-chosen passphrases
is around 2.5–3.2 lower than that of user-chosen passphrases.
Hints Aid Recall of Important Facts. Users could fail to recall passphrases because
they forgot the words they chose during creation, or because they forgot details about
these words, e.g., the exact form of the verb they used (e.g., “work” vs “working” vs
“worked”) or if they used plural or singular form of a noun (e.g., “egg” vs “eggs”). To
investigate these aspects I compare recall for exact match versus relaxed match criteria.
Relaxed matching improves recall by 3–21% over exact matching. The improvement is
larger when no hints are used (8–21%) than with the hints (3–17%). Yet, these improve-
ments are much lower than improvements from using hints (e.g., 12% versus 72% for
UPass), signifying that hints aid recall of important facts.
MNPass Recall Comparable to UPass Recall. Mnemonic-guided passphrase cre-
ation may jeopardize personal significance of passphrases to the user, and thus impair
recall. I investigate this possibility by comparing recall of my three MNPass models
with UPassHint. MNPass(0) is structurally the most similar to UPassHint – both models
98
contain around five user-chosen words per passphrase. Recall of MNPass(0) is a little
lower than that of UPassHint (69.7% vs 71.4% after three days, 66.7% vs 69.6% after
seven days). When one of the words is chosen by the system – MNPass(1) – the recall
stays roughly the same, and comparable to that of UPassHint. Thus system-based gen-
eration of one passphrase word has minor impact on recall. But the generation of all
words by the system (as in SysPass and SysPassHint) drastically lowers recall. When
my increase the length of the MNPass from five to seven words, the recall declines fur-
ther. It is 66.7% for MNPass(0)-Long vs 71.4% for UPassHint after three days, and
62.8% vs 69.6% after seven days).
5.6.3 Mnemonics Improve Security Against Phrase-Dictionary
Attacks
I next investigate the impact of guide-mnemonics on passphrase strength. Table 6.14
shows the median strength (guess number) against LM attacker per passphrase model,
in columns 2–5. I use the exact and the relaxed matching for attacker guesses. I show the
strength against the full LM attacker (when no hint-mnemonics are shown to users) in
columns 2–3, and the strength against the adjusted LM attacker (when hint-mnemonics
are shown to users) in columns 4–5. Columns 6–13 show the strength against phrase-
dictionary attacker as the percentage of passphrases that have 0, 1, 2 and 3+ words
overlapping with some phrase in the dictionary. The longer the overlap, the lower the
strength against phrase-dictionary attacker. I show the overlap for my “Famous phrases”
dictionary (columns 6–9) and for my “Popular pages” dictionary (columns 10–13).
Mnemonics Impact on Strength Against LM Attacks. When mnemonics are
used only during passphrase creation, but not during authentication (columns 2–3 in
Table 6.14), the strength against LM attacker does not change significantly. Welch
Two Sample t-test shows no dierence in means, (t(99)=0.07, p=0.95). System-chosen
99
Passphr. Mod.
LM attack. (guess number)
w/o hint w hint
exact relax. exact relax.
User-chosen passphrases
UPass
8 10
15
4:2 10
15
1:2 10
10
8:8 10
9
UPassHint
Mnemonics-guided passphrases
MNPass(0) 4:0 10
17
1:3 10
17
1:1 10
12
4:0 10
11
MNPass(1) 2:6 10
19
1:0 10
19
1:3 10
16
5:4 10
15
MNPass(0)-Long 2:6 10
21
7:1 10
20
1:5 10
15
1:3 10
15
System-chosen passphrases
SysPass
2:6 10
21
7:9 10
20
3:9 10
17
3:9 10
17
SysPassHint
Table 5.4: Passphrase strength against LM attacker per passphrase model. For LM
attacker, I show the estimated strength guess number.
Passphr. Mod.
Phrase-dictionary attack. (%)
Famous phrases Popular pages
0 1 2 3+ 0 1 2 3+
User-chosen passphrases
UPass
0 8 40 52 0 20 29 51
UPassHint
Mnemonics-guided passphrases
MNPass(0) 0 34 61 5 0 29 63 8
MNPass(1) 8 50 39 3 8 44 48 0
MNPass(0)-Long 6 17 59 18 6 16 60 18
System-chosen passphrases
SysPass
0 100 0 0 0 97 3 0
SysPassHint
Table 5.5: Passphrase strength against phrase-dictionary attacker per passphrase model.
For the phrase-dictionary attacker, I show the percentage of passhphrases, which have a
given number of words overlapping with a common phrase.
passphrases show statistically significant increase in strength (t(74)=-3.07, p=0:002)
from 8 10
15
to 2:6 10
21
. However, when mnemonics are used both during creation and
100
authentication, the passphrase strength against LM attacker drops five orders of magni-
tude, from 4 10
17
to 1:1 10
12
. Thus hint-mnemonics have a large impact on strength
against LM attacker.
I can recoup much of this strength loss, if I allow the system to suggest one word in
a passphrase (MNPass(1)) or if I ask for longer passphrases (MNPass(0)-Long). Both
of these approaches return the passphrase strength against LM attacker to guess number
above 10
15
.
Mnemonics Impact on Strength Against Phrase-Dictionary Attacks. I report
the percentage of passphrases, which have a given-length word overlap with common
phrases in my dictionaries, in Table columns 6–13. I discuss only overlap with famous
phrases (columns 6–9), the other overlap shows a similar trend. Having a 1-word overlap
is common and expected, as passphrases use popular English words. 52% of user-
chosen passphrases have 3+ word overlap with some popular phrase, and 40% have
2-word overlap. In addition, 5% of UPass and 13% of UPassHint passphrases are fully
matched with popular quotes or search results from Phrase-Dictionary Attacks. Those
examples are “where there is will there is way”, “Seven Days Of The Week”, “roses are
red violets are blue”, “eines schickt sich nicht fur alle”, etc. Therefore, phrase-dictionary
attacks can be highly eective.
When mnemonics are used during passphrase creation, 3+ overlap reduces to only
5% for MNPass(0) and 2-word overlap increases to 61%. When the system is allowed
to choose one word in a passphrase, the 3+ word overlap reduces to only 3% for
MNPass(1) and 2-word overlap reduces to 39%. Finally, when users are asked to gen-
erate longer passphrases, 18% of them have 3+ word overlap with popular phrases, and
59% have 2-word overlap. I conclude that MNPass(0) and MNPass(1) models signifi-
cantly increase strength of passphrases against phrase-dictionary attacks.
101
System-Aided Better Than Longer. I can improve strength of passphrases by
making guide-mnemonics longer or by allowing the system to choose one word of
the passphrase. Comparing the strength of the resulting passphrases, I find that both
approaches have the same eect. Both increase the passphrase guess number for the
LM attacker from 1:1 10
12
to 1:3 10
16
or 1:5 10
15
, i.e. by 1,000–10,000. However,
system-aided word selection has much lower negative impact on recall (Section 5.6.2)
and leads to passphrases, which do not significantly overlap the common phrases (3+
word overlap is under 5%).
Mnemonic-Guided Comparable to System-Chosen. Another way to improve
strength of passphrases is to let the system generate all the passphrase words, but that
leads to low recall (Section 5.6.2). Comparing the strength of SysPass, with the strength
of MNPass(1) and MNPass(0)-Long, I find that passphrases fully generated by system
are only around 100 stronger than mnemonic-guided passphrases.
Relaxed Matching Is Acceptable. Relaxed matching lowers security, because more
of the attacker’s guesses lead to successful authentication. I measure this eect by
applying the exact and the relaxed matching toI language model for strength calcula-
tion. Relaxed matching lowers the strength by at most 10, and thus has an acceptable
security cost, while greatly improving recall.
5.6.4 Users Like Mnemonics
I After each participant completed their 7-day authentication, they were asked to rate
their agreement with the following statements, on a Likert-scale, from 1 (strongly dis-
agree) to 10 (strongly agree): (1) Mnemonic hints were helpful for authentication, and
(2) Authentication approach was easy to use. For space reasons, I summarize these
results. Participants strongly agreed that hints were helpful for recall of MNPass (7.76)
and UPass (6.86), but they were much less helpful for SysPass (4.46). Regarding ease
102
of use, participants rated mnemonic-guided passphrases (6.98–8.25) as highly as user-
chosen passphrases (6.57–7.86), and they rated system-chosen passphrases much lower
(2.51–3.15).
5.7 Conclusions
It is dicult to create a passphrase mechanism which has both a high recall and a high
strength against statistical attacks. I explored the use of mnemonics to this eect. I found
that use of mnemonics as authentication hints significantly improves recall because it
helps users remember which words they chose during passphrase creation. Mnemonics
can also be used to guide passphrase creation, which reduces use of common phrases
and improves strength against statistical attacks. While the use of mnemonics at authen-
tication lowers security, I can recoup this loss by allowing the system to choose one word
in a passphrase. This passphrase model keeps both the security and recall of passphrases
high. The security matches that of system-chosen passphrases, while the recall matches
that of user-chosen passphrases. Mnemonics at passphrase creation will further prove
to be a valuable aid in reducing passphrase reuse. I believe mnemonics are a promising
technique in our quest to improve user authentication.
103
Chapter 6
Semantically-Guided Password
Generation (GuidedPass)
6.1 Introduction
Left to their own devices, users create passwords that may be weak but that are mem-
orable. Systems can improve this practice in two ways. First, systems can suggest or
enforce specific password composition policies, which lead to stronger passwords. But
it has been shown that password composition policies are not consistent across dierent
sites [JYW
+
15, ESM
+
13, KSK
+
11, UKK
+
12], which indicates lack of clear understand-
ing by system administrators of the role that password composition plays in determin-
ing password strength. Further, stringent password composition requirements increase
users’ frustration and lead them to write down their passwords [IS10], which is a bad
practice. Another way to improve a user’s password choice is to oer real-time feedback
on this choice and, optionally, suggestions for improvements. NIST recently proposed
a new password composition policy [GFN
+
17], which enforces minimum 8 characters
and requires systems to reject passwords that appear on the list of previously-leaked
passwords or common dictionary words. Password meters also oer real-time feed-
back on user password strength [JYW
+
15], although this feedback may be inconsis-
tent [JYW
+
15, ESM
+
13, KSK
+
11, UKK
+
12]. But the problem lies in the fact that both
NIST and meter approaches inform users which of their passwords are bad, without
providing clear guidance how to evolve them into a better password.
104
In this work, I propose a new approach, called GuidedPass, which oers real-time,
actionable and constructive feedback to users how to improve their password choices to
achieve higher strength. I start from an insight that users make password choices starting
from facts and semantic structures that are memorable to them, i.e., from facts that have
some personal significance. I then tailor my suggestions in such a way to preserve, as
much as possible, those facts and structures, while improving password strength.
Latest research by Ur et. al [UAA
+
17], CMUPass, is a data-driven password meter
to provide textual proactive password suggestion to users They provided more specific
suggestions to users than GuidedPass, and does not display semantic patterns to users.
Their derivations of rules are not described in details. Also, their evaluation is performed
with specific password composition policy. So, it is dicult to understand the eective-
ness of suggestions. Their approach focuses improving strength like other meters, and
eort to preserve and improve memorability have not been shown.
My suggestions are devised based on analysis of more than 3,200 passwords that
were successfully recalled by participants in my prior user studies. I specifically
focus on identifying structural and content-based dierences between medium-strength
and high-strength passwords, so I can devise suggestions that would evolve low- or
medium-strength passwords into high-strength ones. I evaluated my approach and sev-
eral competing approaches via a user study, where each participant was asked to create
a password using one approach and return to authenticate within two days. Guided-
Pass improves strength of initial user input 10
7
times, while preserving high memo-
rability. Passwords created with GuidedPass also 100 times stronger and 20% more
memorable than passwords created with only password-meter feedback. Also, Guided-
Pass shows 10
5
more stronger and 14% more memorable than new NIST draft. Finally,
although both CMUPass and GuidedPass oer specific guidance to users how to improve
their passwords, GuidedPass outperforms CMUPass in memorability (80% versus 70%)
105
because I provide suggestions that are more semantically meaningful to users while
CMUPass focuses on mainly improving strength.
6.2 Background and Related Work
I now provide more details about specific related work that aims to increase user pass-
word strength at password creation time.
Password composition policies are regularly used to steer users toward stronger pass-
words. A commonly used password policy is the 3class8 policy, which requires a pass-
word to be at least 8 characters long, and to include at least three out of four charac-
ter classes: digits, uppercase and lowercase letters and special characters. However,
there are many inconsistencies among password policies [SB04, FH10], and a lack of
clear understanding which policy is the best. Further, even when users meet the policy
requirements, their passwords can still be weak because they are created from com-
mon words and patterns. For example Password123 satisfies 3class8 requirement but
is among top 33,523 out of 14,344,391 passwords and occurs 62 times from leaked
RockYou datasets.
In [KKM
+
12], Kelly et al. found that passwords created under minimum 8 characters
policy are significantly weaker than passwords created under stricter policies. Shay et.
al. [SKD
+
14] compared eight dierent password composition policies and found that
a long password with fewer constraints can be more usable and stronger than a short
passwords with more constraints, which may also lead to unsafe practices like writing
down passwords [SKK
+
10].
Telepathwords [KSC
+
14] provide proactive suggestions to users, highlighting weak
or predictable password patterns. The system learns character distribution based on its
existing password data, and uses it to detect likely character patterns to follow what user
106
has typed. These patterns are flagged as bad and user is thus steered towards less likely
patterns. Telepathwords performance data shows what we would expect – password
strength increases due to system intervention with median 3.7 bits in zxcvbn [Whe16]
entropy measure, but recall suers with around 62%, because users are steered from
words that are meaningful to them towards those with lower personal significance.
NIST [GFN
+
17] has recently proposed a new textual password composition policy.
The new proposal removes requirements for dierent character classes but leaves the
length requirement. The system is also required to flag any previously leaked password
or a password that contains common dictionary words as weak. While this feedback
informs the user what parts of their password may be susceptible to a guessing attack,
it does not provide clear guidance how to build a better password. Such guidance is
needed, because in absence of it user only make small changes to their initial pass-
words [HCM
+
17].
Another way to steer users toward strong passwords is to use a password meter. Sev-
eral researchers [ESM
+
13, SKD
+
16, UKK
+
12] found that password meters help users
create stronger passwords and others [KSK
+
11, SKK
+
10] found that a combination of
password strength meter and password composition policy leads to even stronger pass-
words. However, there is a significant inconsistency in scoring a password across dier-
ent meters [dCM14].
In [UAA
+
17], Ur et al. present a system similar to ours, which provides real-time,
specific guidance to users on how to improve their passwords. Their approach recom-
mends users to generate minimum 12 characters password. Also, it detects a range of
patterns should be avoided such as dictionary words, common passwords, etc. However,
they provide less number of direct guidances than GuidedPass to users on how to mit-
igate those patterns from users’ current input passwords. For example, upon detecting
107
a predictable sequence or dictionary word, their approach does not provide direct rec-
ommended actions to users. Another point of dierence between our work and Ur et
al. [UAA
+
17], is that they develop their suggestions for secure passwords by learning
about secure patterns from leaked password datasets. But, there is no way to verify if
these passwords were all memorable to their users. I use a dataset of proven-memorable
passwords and thus we believe our suggestions interfere less with memorability. In the
later section, I provide side-by-side comparison of suggestions generated from Guided-
Pass and the meter by Ur et al. [UAA
+
17] with dierent examples.
6.3 Pasword Dataset
In this section, we first describe our password dataset, which we used to analyze good
password patterns. We then provide more details about the patterns we have identified.
6.3.1 Dataset Statistics
We ran several IRB-approved user studies over the period of 3 years, to test LEPs and
compare them with ordinary, 3class8 passwords. In these studies we collected more
than 3,200 passwords, which were successfully recalled by participants 2 days or later
after creation. We thus consider these passwords memorable and seek to evaluate their
security and identify patterns that make them secure.
We used the Monte Carlo method in [DF15] to measure each password’s strength.
The method was trained on a total of 21 million leaked passwords from Rock You,
LinkedIn, MySpace, and eHarmony. Based on whether the password’s strength was
below an online attacker capability (10
6
guesses [FHVO16b]), between online and
oine attacker capabilities (10
6
and 10
1
4 [FHVO16b]), or above than oine attacker
108
capability, we classified it into the weak, medium or strong category respectively.
Table 6.1 shows the number and percentage of passwords in these categories.
strength class count %
weak (guess< 10
6
) 109 3.34
medium (10
6
guess< 10
14
) 2,276 69.82
strong (guess 10
14
) 875 26.84
total 3,260 100
Table 6.1: Memorable password dataset, categorized into three dierent strength groups
6.3.2 Password Composition
We how analyze password features and composition.
Length: Table 6.2 shows the average, standard deviation and median length of
passwords in our three strength groups, and Figures 6.1 and 6.2 show the empiri-
cal cumulative and probability distribution of password length. The length was very
dierent between the strong, median, and weak groups (KW test p = 9:87 10
151
),
while the dierence is smaller but still significant between weak and medium groups
(Holm-Bonferonni-corrected Mann-Whitney U (HC-MWU) test, p = 2:31 10
5
). The
statistical dierence between medium and strong group is significant ( HC-MWU test,
p = 3:11 10
142
). Hence length seems to play a critical role for password strength.
Strength 3class8 Length
type perc.(%) avg. median std.
weak 6% 8.83 9 1.74
med 58.10% 9.88 9 1.98
strong 74.20% 13.73 13 4.8
Table 6.2: Percentage of 3class8 passwords and password length statistics
Number of 3class8 passwords: We show the percentage of password that meet the
3class8 requirement in Table 6.2. More than half of medium-strength passwords satisfy
109
Figure 6.1: Empirical CDF of password length
Figure 6.2: Empirical PDF of password length
the 3class8, and 25% of strong passwords are not 3class8. Hence, 3class8 policy does
not directly lead to strong passwords.
Number of symbols, digits, and uppercase letters: We show the statistics for the
number of symbols, digits and uppercase letters in Table 6.3. All strength groups had
similar statistics for number of symbols and there was no statistical dierence between
110
Strength Symbols Digits Uppercase Class ch.
type avg (std) med avg (std) med avg (std) med avg (std) med
weak 2.3 (1.8) 2 0.1 (0.4) 0 0 (0.1) 0 1.0 (0.9) 1
med. 2.6 (1.6) 2 0.7 (0.7) 1 0.2 (0.5) 0 1.8 (0.9) 2
strong 2.6 (1.9) 2 1.1 (1.0) 1 0.6 (0.9) 0 2.8 (2.1) 2
Table 6.3: Average, median and std. of number of symbols, digits and uppercase letters
and number of class changes for passwords in each strength type
them. However, there was statistical dierence with regard to the number of digits
present in weak, medium and strong passwords (KW test, p = 2:66 10
43
), with
stronger passwords having more digits. The statistical significance between strong and
medium group with HC-MWU test was p = 3:68 10
18
. And HC-MWU test yields
p = 5:96 10
24
between medium and weak group. Similarly , stronger passwords
also had a higher incidence of uppercase letters (KW test, p = 1:96 10
55
). The
statistical significance between strong and medium group with HC-MWU test was p =
1:18 10
48
. And HC-MWU test yields p = 3:73 10
4
between medium and weak
group.
Number of class changes: We define a class change as having two consecutive
substrings in a password, from dierent character classes. Statistics for this measure
are shown in columns 8 and 9 of Table 6.3. As password strength increased so did
the number of class changes (KW test, p = 1:87 10
73
). The statistical significance
between strong and medium group with HC-MWU test was p = 4:58 10
47
. And
HC-MWU test yields p = 1:62 10
27
between medium and weak group.
6.3.3 Semantic Patterns
In this section we analyze semantic patterns of strong, medium, and weak passwords.
We use Veras’s semantic parser [VCT14b] to segment each password and label segments
111
with their part-of-speech (POS tags). We also use zxcvbn’s [Whe16] heuristics to detect
dictionary words, personal names and leaked passwords within each password.
We show the percentage of unique semantic patterns in each strength category in
Table 6.6. 91% of strong passwords have a unique pattern, while 47.7–51.3% of medium
and weak passwords have a unique pattern. Thus high strength may be correlated with
uniqueness of the semantic pattern.
Uniq. semantic pattern (%)
weak 47.7%
med 51.3%
strong 91.1%
Table 6.4: Percentage of unique semantic pattern
Weak type % Med. type %
(np1)(number4) 1.94 (np1)(number4) 3.6
(np1)(number2) 1.49 (np1)(special1)(number4) 2.15
(nn1)(number2) 0.8 (np1)(number3)(special1) 1.85
(nn1)(number1) 0.57 (np1)(number4)(special1) 1.63
(nn1)(number3) 0.46 (np1)(number2)(special1) 1.54
(nn1)(number4) 0.46 (nn1)(number1)(special1) 1.27
(number8) 0.46 (nn1)(number3)(special1) 1.27
(np1) 0.34 (np1)(number1)(special1) 1.19
(np1)(number1) 0.34 (nn1)(number4) 0.97
(np1)(number3) 0.34 (nn1)(special1)(number4) 0.97
Table 6.5: Top 10 semantic patterns of weak and med. strength group
Number of semantic tags and its sequence: Next, we investigate the number of
semantic segments in a password. Table 6.7 shows the average, median, and standard
deviation of the number of segments in each password group, and we show the empirical
CDF and PDF in Fig. 6.3 and 6.4, respectively. Weak passwords have only 2 segments
in median, while medium-strength passwords have 3 segments and strong passwords
have 5 segments. A user can create more segments by adding more words to a password
or by breaking up password words, interleaving them with symbols and digits.
112
Strong type %
(char1)(nn1)(char2)(number4)(special1) 1.03
(char1)(ppis1)(number1)(special1)(number4) (char1)(ppis1) 1.03
(special1)(at)(nn1)(jj)(number4)(special1) 0.91
(char1)(nn2)(char2)(number2)(special1) 0.8
(char4)(number2)(special2) 0.57
(char6)(number3)(special1) 0.46
(nn)(number2)(special1) 0.46
(np1)(np1)(number2)(special1) 0.34
(number4)(char1)(special1)(nn1)(special1)(char2) 0.34
(special1)(nn)(nn1)(number2) 0.34
Table 6.6: Top 10 semantic patterns of strong strength group
Num. of semantic tags avg. median std
per password
weak 2.21 2 1.05
med 3.44 3 1.09
strong 5.22 5 2.2
Table 6.7: Number of the semantic tags in each password type
Figure 6.3: Empirical CDF of the number of semantic tags
113
Figure 6.4: Empirical PDF of the number of semantic tags
6.3.4 How to make passwords stronger
Here, we oer our conclusions on what makes a password stronger, based on our data
analysis.
Uncommon or non-dictionary words. Even with the same patterns, for example,
(np1)(number4), a password may have medium or high strength. Thus the commonness
of the noun used in the password (np1) will determine the password strength. After
analyzing some strong passwords, we found that it is not as dicult as one might think,
to create uncommon words. For example, strong passwords often had a dictionary word
broken into segments by interleaving it with digits or symbols or by misspelling.
The longer, the better. There is a strong correlation between password strength and
length using the Pearson Product-Moment correlation (r = 0:533; p = 2:2681 10
238
),
and also between the strength and the number of semantic segments in the password
(r = 0:543; p = 2:2681 10
249
).
Non common structure. We observed that after creating a simple phrase, users with
strong passwords evolved its structure into a non-common structure by adding random
114
digit or symbols in unexpected way. For example,I Love you vs. IL o vey ou yielded
dierent strength by adding symbols in a non predictable position.
Foreign language words. We observed that passwords that include some part of
words from foreign language often fell into strong category. One can argue that attackers
can develop statistical guessing tools using dierent languages, but such tools are not
yet popular.
Long, memorable digit sequences. 25 strong passwords had longer than 8 digit
sequences that did not resemble date or phone number format.
6.4 Designing Suggestions
I assume that users initially choose passwords based on some facts that have personal
significance to them, which makes them memorable. The goal of my suggestions to
a user it so evolve the user’s existing password into a stronger version without losing
memorability. I further strive to provide high-level, general, “fuzzy” suggestions with-
out being too specific. I do this for two reasons. First, I want to allow sucient space for
users to interpret these suggestions in a way that does not lower memorability. Second,
if my suggestions were too specific, attackers could use this to create tailored guessing
attacks, which would be much more ecient than my estimates of the resulting pass-
word strength. I propose the following password suggestion model:
Password
new
= f
N
(Password
current
; M
new
); (6.1)
where Password
current
is the user’s current password, M
new
is some new text added
to the password, and f is a function that the user performs to integrate M
new
with
Password
current
. We focus on types of functions that an average person could trivially
115
perform. In [BV15], Blum et al. [BV15] proposed a humanly-computable password
model, including some computation operations that a user would perform to generate
a password. In my project, I narrow down this list to the following operations: addi-
tion, insertion, replacement with or without deletion, swapping, breaking/perturbing
sequence and redistributing/separating/moving. I do not suggest deletion since it reduces
the length of passwords. Next, I consider types of new information, M
new
, the user can
enter. As I characterize from previous section, for strong passwords M
new
should be
chosen from uncommon words, symbols, or digits. Finally, my suggestions are given
in a fuzzy manner (e.g., suggesting breaking of patterns detected in a password, but not
exactly where/how to break them), i.e., they are as general as possible to preserve the
large space of possible passwords.
6.4.1 Suggestions
To be able to suggest the right changes to a password, we first need to detect semantic
content and patterns of a user-entered password in real time. Using our POS segmenta-
tion and zxcvbn [Whe16] tool we detect dictionary words, names, common sequences,
and blacklisted passwords. Upon detecting problematic content or patterns, we high-
light them and generate targeted suggestions. Table 6.8 shows our suggestions for each
problematic pattern/content with several examples. Also, we present the suggestions
generated from Ur. et al. in Table 6.8 for comparison.
At each step we present all applicable suggestions to users, instead of just one or
a few, so that users can choose which ones to adopt without sacrificing memorability.
While this makes for a longer reading for users we believe, and our results confirm this,
that it guides them to the strong version of their initial password in a more straightfor-
ward manner.
116
User Input Category GuidedPass CMUPass by Ur
et al. [UAA
+
17]
John Top 1K popular
names
1.Add an uncom-
mon name
2.Add a few num-
bers or symbols in
the middle of the
name
1.Contain 8+ char-
acters
2.Not be an
extremely com-
mon password
Password123 Leaked top 50K
passwords
1.Add an uncom-
mon word
2.Add a few num-
bers or symbols in
the middle of a
word
1.Not be an
extremely com-
mon password
12345 Sequence 1.Perturb the
sequence or sep-
arate into a few
segments
1.Contain 8+ char-
acters
2.Not be an
extremely com-
mon password
aabbccaabbcc Repeating pattern 1.Add an uncom-
mon word
2.Move a few
numbers or sym-
bols to the middle
of the pattern to
break repeating
pattern
1.Don’t use words
used on Wikipedia
(ccaa)
2.Avoid repeating
sections (aabbcc)
3.Have more vari-
ety than repeating
the same 3 charac-
ters (a, b and c)
defense Popular dictionary
word
1.Add an uncom-
mon word
2.Add a few or
symbols in the
middle of a word
1.Don’t use dic-
tionary words
(defense)
6122017 Date 1.Perturb the
sequence or sep-
arate into a few
segments
1.Avoid using
dates like 6122017
defense6122017 Simple structure 1.Add one of the
following: uncom-
mon word, uncom-
mon name, or mix
of symbols
1.Consider insert-
ing digits into the
middle, not just at
the end
Table 6.8: Comparisons of generated suggestions 117
Both approaches do well in detecting problematic or weak patters, and generate
suggestions based on those as shown in Table 6.8. However, GuidedPass provides
more direct actions for users to perform such as “Add uncommon name” or ”Add a
few numbers or symbols in the middle of the name” to avoid detected patterns, while
CMU approach only provides less actionable feedback, only indicating “Not be an
extremely common password.” for the same input passwords. Upon detecting a pre-
dictable sequence “12345”, GuidedPass guides users on how to break this predictable
sequence. But CMU’s approaches is not so informative on the types of actions users
take. These dierences in suggestion apply to dictionary words, leaked passwords, and
date as shown in Table 6.8. CMUPass provides more informative and helpful sugges-
tions for repeating sequences such as “aabbccaabbcc.” Similarly, GuidedPass and CMU-
Pass guides users to create a password with a diverse character class. But in our case,
dierent classes not only include character classes but also semantic classes such as
uncommon words, or uncommon names. Hence, GuidedPass adds more diversity in
suggestions. On the other hand, CMUPass enforces the minimum length of a password
while in current GuidedPass, the minimum length policy is not enforced.
6.5 Experiment
I now describe user studies I employed to evaluate benefits of GuidedPass over other
competing approaches. All user studies were reviewed and approved by my Institutional
Review Board (IRB). I recruited participants from Amazon Mechanical Turk workers.
6.5.1 Approaches
We now provide more details about the password creation approaches we evaluate. They
are summarized in Table 6.9.
118
Approach Description
GuidedPass Our approach with detailed textual suggestions with zxcvbn meter
enforcement
GuidedPass-NE GuidedPass with no zxcvbn meter strength enforcement but only
showing visual bar
CMU-NE Ur et al.’ [UAA
+
17] textual suggestions with no zxcvbn meter
strength enforcement but only showing visual bar
zxcvbn zxcvbn meter [Whe16] with meter strength enforcement
zxcvbn-NE zxcvbn meter [Whe16] with no meter enforcement but only visual
bar
NewNIST New NIST Proposal (800-63) [Gui06] (minimum 8 characters and
blacklist password enforcement)
Table 6.9: Password creation approaches
We evaluate two flavors of our guided password creation: GuidedPass and
GuidedPass-NE. In both approaches we provide two types of feedback to users about
their passwords: (1) detailed textual analysis of their password and specific suggestions
for improvement and (2) visual password strength meter from zxcvbn [Whe16]. The
dierence between approaches is that GuidedPass-NE does not enforce the meter, i.e.,
a user may choose to accept a lower password strength and finish the creation process,
while GuidedPass forces each user to continue password creation until the password
strength, as measured by the meter, exceeds zxcvbn score 5. We use zxcvbn meter’s
visual bar in both approaches to inform strength progress to users.
We compare our work with Ur et al. suggestions [UAA
+
17], and denote this
approach as CMU-NE. In [UAA
+
17], they enforced dierent password composition
policies such as minimum length requirement and requirement for specific classes along
with their suggestion. In this work, any password composition policy is not enforced
in order to better understand the eectiveness of suggestions. Also, their original meter
have a dierent meter strength scale and implementation. In order to make a fair compar-
ison with other approaches described in this work, we customize Ur et al.’s suggestions
119
with zxcvbn strength meter likewise GuidedPass-NE and zxcvbn-NE. However, we did
not enforce meter strength but only display the visual bar to show the progress.
We further evaluate value of textual guidance to users about how to improve their
passwords over no guidance. For this we evaluate just the password meter feedback,
with and without enforcement. These are approaches zxcvbn and zxcvbn-NE respec-
tively. We also evaluate the NIST’s proposed guidance for password creation, called
NewNIST, where users are required to create at least 8-character password. Users are
further alerted when their password is on a blacklist and forced to change it, but are not
oered any guidance about how to change.
6.5.2 User Study Design
In the user study, each participant was assigned at random to one approach for password
creation. I recruited participants with at least 1,000 completed Human Intelligence Tasks
(HITs) and> 95% HIT acceptance rate. I asked each participant to create one password
for an imaginary server. After two days each participant was invited to return to the
study and attempt to authenticate with her password. I paid 35 cents for password cre-
ation and 40 cents for the authentication task, respectively. Participants were paid for
authentication regardless of whether they recalled their password correctly or not.
Authentication. Each user was asked to authenticate two days after password cre-
ation. I allowed at most five trials to authenticate per password and per visit. All users
were asked not to paste their answers. I had automated detection of copy or paste
attempts in my login forms, and I rejected the users who were detected to perform
either of these two actions. I further displayed a notice to participants, at both the cre-
ation and authentication screens, that they will get paid regardless of their authentication
success. This ensured that participants had no monetary incentive to cheat. At the end
120
of the authentication visit, I asked participants to complete a short survey to asses their
sentiment about usability of each password creation approach.
6.5.3 Statistical Tests
In order to analyze the statistical significance across dierent approaches, I performed
for an ominbus test,
2
, on categorical data with p = 0.05. For successful recall without
assuming specific distribution of data, we performed the pairwise Fisher’s Exact Test
(FET), which yields more accurate confidence for relatively smaller sample size. For
multiple-testing correction, we used the Holm-Bonferroni (HC) method.
6.5.4 Limitations and Ecological Validity
Our study had the following limitations, many of which are common for online password
studies. First, it is possible but very unlikely that a participant may enroll into our study
more than once. While the same Mechanical Turk user could not enter the study twice
(as identified by her Mechanical Turk ID), it is possible for someone to create multiple
Mechanical Turk accounts. There is currently no way to identify such participants.
Second, we cannot be sure that our participants did not write down or photograph
their passwords. We did not ask the participants if they have done this in post-survey,
because we believed that those participants who cheated would also be likely to not
admit it. We designed our study to dis-incentivize cheating. We promised to pay par-
ticipants in full regardless of authentication success. Our study mechanisms further
detected copy/paste actions and we have excluded any participant that used these (for
whatever reason) from the study. We also reminded the participants multiple times to
rely on their memory only. If any cheating occurred it was likely to aect all the results
uniformly. Thus our data can still be used to study improvement of recall and security
between password creation approaches.
121
Model Created Auth. after 2 days
GuidedPass 218 150
GuidedPass-NE 207 148
CMU-NE 180 119
zxcvbn 204 142
zxcvbn-NE 203 127
NewNIST 219 162
Total 1,231 848 (69.2%)
Table 6.10: Total number of participants who created and authenticated
Third, while we asked Mechanical Turkers to pretend that they were creating pass-
words for a real server, they may not have been very motivated or focused. This makes
it likely that actual recall of real-world passwords would be higher across all creation
approaches. While it would have been preferable to conduct our studies in the lab, the
cost would be too high (for us) to aord as large participation as we had through the use
of Mechanical Turks.
6.6 Results
In this section, we present the results of our user study. We present the demographic
information, the password strength and recall, and the number of steps it took for a
user to converge to a suitable password. We also provide the user sentiment on each
approach.
6.6.1 Participant Statistics
In total, there were 1,231 participants that created passwords. Two days after creation,
we sent an email to all of them to return for authentication. Out of 1,231 participants,
848 participants returned (return rate 69.2%). as shown in the Table 6.10.
122
Among 1,231 participants, 51% reported being male and 48% reported being female.
Also, 82% reported that their native language were English. With regard to the age
range, most participants were in 25-34 age group (54%), followed by 35-44 (27%) and
45-54 (12%) age groups. We found no statistically significant dierence in any of our
metrics between participants of dierent age, gender or with dierent native language.
6.6.2 Password Statistics
We show the average length, median, and standard dev. of each password created under
dierent approach in Table 6.11. GuidedPass, GuidedPass-NE and zxcvbn were the
longest, with average of 13–13.9 characters. This was closely followed by CMU-NE
and zxcvbn-NE (12-12.2 characters), while newNIST had the shortest length (10.7 char-
acters).
Approach Avg. Median STD
GuidedPass 13.5 13 3.0
GuidedPass-NE 13 13 2.9
CMU-NE 12.2 12 3.3
zxcvbn 13.9 13 3.3
zxcvbn-NE 11.9 11 4.0
NewNIST 10.7 10 3.5
Table 6.11: Average, median and STD of password length
6.6.3 Recall after two days
We asked users to authentication 2 days after password creation. Recall was successful
if the user correctly inputted every character in the password. The Table 6.12 shows the
overall recall performance. The second column shows the numbers of participants who
successfully authenticated in that password creation approach, and the total number of
123
participants who returned for authentication. The third column shows the recall success
rate.
GuidedPass is memorable. GuidedPass and GuidedPass-NE were the top two
approaches based on recall. GuidedPass-NE had almost 13.5% higher recall than CMU-
NE, the most related competing approach. The main dierence between our approaches
lies in specificity of the guidance provided to users. We believe that the specificity of
our feedback enabled users to create strong passwords without sacrificing memorability.
Compared between the same approaches with and without target strength enforce-
ment (GuidedPass vs GuidedPass-NE, zxcvbn vs zxcvbn-NE), enforcement lowers
recall by 10–21%. Further, when we provide user guidance during password creation
(GuidedPass vs GuidedPass-NE) we lose less recall (10%) than when no guidance is
provided (zxcvbn vs zxcvbn-NE loses 21%). Finally, approaches that oer no guidance
to users during password creation (zxcvbn, zxcvbn-NE and NewNIST) have generally
lower recall by up to 31% than approaches that provide guidance. We believe that when
guidance is lacking users have to search for a suitable password longer, and they explore
more possible modifications, focusing on satisfying the password strength requirement.
This leads them to converge on less memorable passwords than when specific guidance
is provided.
Approach Num. of Succ. Succ. Recall
Users/ Total Users Percentage
GuidedPass 109/150 72.67%
GuidedPass-NE 120/148 81.08%
CMU-NE 85/119 71.43%
zxcvbn 79/142 55.63%
zxcvbn-NE 90/127 70.87%
NewNIST 109/162 67.28%
Table 6.12: Recall rate after 2 days.
124
6.6.4 Creation Time
We show the time needed to create a password (time between the initial and the final
password input by user) as average in Table 6.13. The average time to create a password
was up to two times higher for suggestion-based approaches than for those that oer no
user guidance. This is expected, as users take time to read textual feedback. GuidedPass,
GuidedPass-NE and CMU-NE all had comparable password creation times of just under
2 minutes (105–111 seconds). Approaches that do not enforce a given target strength
had the lowest password creation time (zxcvbn-NE had 53 seconds, and New NIST
had 62 seconds), while the approach that enforced a given target strength but did not
oer guidance to users (zxcvbn approach) took 60% longer (88 seconds instead of 53
seconds).
Approach Avg. time (sec)
GuidedPass 110
GuidedPass-NE 105
CMU-NE 111
zxcvbn 89
zxcvbn NE 53
New NIST 62
Table 6.13: Average password creation time
6.6.5 Strength
We evaluate strength of each password collected in our study using the guess num-
ber measure, which estimates a number of attempts that a statistical guessing attacker
would need to successfully guess the password [KKM
+
12]. One of the most popular
ways to calculate a guess number is a probabilistic approach based on emulating attacks
trained with a given dataset [KKM
+
12]. We use the approach by Dell’Amico and Fil-
ippone [DF15], which estimates the number of guesses until success using Monte Carlo
125
sampling over a probabilistic password model. They have shown the accuracy of such
estimated strength against state-of-the-art attacks. We trained several password models:
the 2 gram and 3 gram model, and the Monte Carlo back-o model [DF15]. For
training the models, we used a total 21 millions of leaked passwords from Rock You,
LinkedIn, MySpace, and eHarmony. The results are presented in the following section.
GuidedPass is strong. We summarize the median strength in Table 6.14 and show
empirical CDFs in Figs. 6.5 - 6.8. The last column of the table shows the median strength
of passwords, where median is calculated among the lowest-guess numbers for each
password. GuidedPass and zxcvbn, with strength enforcement, produce the strongest
passwords in most measures, with zxcvbn being stronger than the GuidedPass. Fur-
ther, GuidedPass-NE outperforms CMU-NE requiring around 10 times more guesses.
It is interesting to note that without enforcement GuidedPass strength did not degrade
much (around 10 times), while zxcvbn strength degraded a lot (around 10,000 times).
Thus user guidance helped create strong passwords even without enforcement. Finally,
NewNIST performed very poorly, requiring in general around 100 times fewer guesses
than other approaches.
Approach 2-gram 3-gram Back-o min. Guess
GuidedPass 3.43E+19 5.62E+18 5.18E+19 1.44E+18
GuidedPass-NE 7.4E+18 5.04E+17 1.45E+18 4.12E+16
CMU-NE 1.38E+18 5.55E+16 2.29E+17 8.12E+15
zxcvbn 7.45E+20 2.55E+19 9.09E+19 4.10E+18
zxcvbn-NE 3.44E+16 3.95E+15 1.74E+15 3.25E+14
NewNIST 4.87E+14 8.26E+13 6.53E+13 5.95E+12
Table 6.14: Median guess number, measured using 2 gram, 3 gram and Monte Carlo
back-o. The last column shows the median number of guesses among the lowest guess
values produced by the three methods.
126
Figure 6.5: Password strength with 2-gram model
Figure 6.6: Password strength with 3-gram model
6.6.6 Pattern Analysis
In GuidedPass approach, we present all the suggestions to the user. We wanted to mea-
sure if specific suggestions were more frequently adopted by users and how much they
127
Figure 6.7: Password strength with back-o model
Figure 6.8: Optimal attacker chooses min from (2-gram, 3-gram, and back-o)
improved initial password strength. We captured the time of content of users’ every key
stroke entered into the password box during the study, including back spaces.
Table 6.15. show all the suggestions adopted by the users, broken by the category.
On average, a user adopted 4.12 suggestions. Most popular suggestions were those that
asked for information to be added – they made 81% of adopted suggestions, while those
128
Rules Perc. (%)
Add chars 2.77
Add digits 27.46
Add symbols 17.63
Add uncommon words 24.94
Add words 7.81
Total “Add” suggestions 80.6%
Flip Case 2.02
Insert chars 1.01
Insert digits 2.52
Insert symbols 2.52
Insert uncommon words 0.76
Insert words 1.26
Break sequence 0.25
Delete 8.82
Replace word 0.25
Total “Structure Change” suggestions 19.4%
Table 6.15: Overall suggestions employed by users
that asked for changing structure made for only 19%. Among “Add” suggestions most
popular were those asking to add digits (27%) and uncommon words (25%). Close to
20% of users attempted to change the structure of their passwords. Inserting digits and
symbols in the middle of an existing password or changing case were the most adopted
suggestions (around 2%).
Next, we measure the dierence in strength, using guess number, between the initial
and the final password for each given user. The initial and final strength distribution i
shown in Fig. 6.9, where the x-axis is the log of guess number. The overall strength
improvement is about 10
7
–10
10
from users’ initial input to final passwords as shown in
Fig. 6.9. Hence, suggestions did significantly increase user password strength.
129
Figure 6.9: Empirical CDF of strength improvement between the initial and the final
password.
6.6.7 User Sentiment
After each participant completed their authentication task, they were asked to rate their
agreement with the following statement, on a Likert-scale, from 1 (strongly disagree)
to 10 (strongly agree) with 5 being neutral: the password creation was easy to use.
Following Table 6.16. shows the average and standard deviation of the Likert scale
score for all approaches.
All approaches had ratings higher than 5, which means that users found them easy to
use. Approach zxcvbn-NE was the easiest to use with the highest rating. However, with
meter enforcement, zxcvbn had the worst rating with 6.30, since users were frustrated,
trying to exceed the target password strength without clear guidance how to do this. Our
approaches (GuidedPass and GuidedPass-NE) had the average 7.29 and 7.34 ratings.
130
Approach Avg. Likert Score std
GuidedPass 7.29 2.52
GuidedPass-NE 7.34 2.41
CMU-NE 7.52 2.25
zxcvbn 6.30 3.06
zxcvbn-NE 7.59 2.69
NewNIST 7.35 2.73
Table 6.16: Average ratings of “this approach was easy to use” claim on Likert 1–10
scale, with 10 being the strongest agreement and 1 being the strongest disagreement.
We also asked more specific questions about password suggestions of those partic-
ipants that were assigned to CMU-NE and GuidedPass-NE approaches. We used the
same survey questions as Ur. et al. [UAA
+
17]. We show the questions and the average
scores in Table 6.17. Questions 1–3 asked users to rate their agreement with the claim
in the question on the Likert 1–10 scale with 10 being “Strongly Agree” and 1 being
“Strongly Disagree”. Question 4 we coded a user’s “Yes” response as a 1 and a user’s
“No” response as 0. Questions 5–7 asked users to rate their agreement with the claim in
the question on the Likert 1–5 scale with 5 being “Strongly Agree” and 1 being “Least
Agree”.
There was very little dierence that users perceived between these two approaches.
Overall they were found not so annoying, mildly fun and not so dicult. Around 70%
of users followed suggestions from either approach. They also found that suggestions
were mostly informative and helpful to create a strong password.
6.7 Conclusions
In this chapter I described my work on designing and evaluating a semantically-driven
password suggestion system, called GuidedPass. GuidedPass passwords are 10
7
times
stronger and as memorable as user initial passwords. Also, GuidedPass passwords
131
Survey
Question GuidedPass-NE CMU-NE
Q.1. Creating a password under this approach was
annoying? (1-10)
3.82 4.14
Q.2. Creating a password under this approach was
fun? (1-10)
6.38 5.98
Q.3. Creating a password under this approach was dif-
ficult? (1-10)
4.29 4.19
Q.4. Did you follow any of suggestions? (0:No,
1:Yes)
0.76 0.70
Q.5. Suggestions helped me create a strong password.
(1-5)
3.65 3.68
Q.6. Suggestions was informative. (1-5) 3.74 3.86
Q.7. Suggestions helped me create a dierent pass-
word than would have otherwise. (1-5)
3.64 3.56
Table 6.17: Survey responses
are 100 times stronger and 20% more memorable than passwords created with only
password-meter feedback without any strength enforcement. Our research demonstrates
that it is possible to build evolve a user’s initial password into a stronger version of itself,
without losing memorability. Future work in this space would be to explore a probabilis-
tic approach for selecting which suggestions to present to users, instead of presenting all
of them. This would reduce user burden and password creation time and it may improve
password diversity.
132
Chapter 7
Conclusion
Textual passwords are a widely used form of authentication and suer from many defi-
ciencies because users trade security for memorability. Users create weak passwords
because they are memorable, and reuse the same password across many sites. Forcing
users to create strong passwords does not help, as these are easily forgotten.
I have proposed life-experience passwords (LEPs) as a new authentication mecha-
nism, which strikes a good balance between security and memorability. I investigated
several LEP designs and evaluated them in two user studies. My results show that LEPs
are much more memorable and secure than passwords, they are less often reused, and
they are strong against friend guessing. While they take more time to create and input
during authentication, I believe their benefits may make them a viable primary authen-
tication mechanism for high-security servers, or much better secondary authentication
mechanisms than security questions. Also, I have proposed MNPass to improve user
recall while maintaining strength generated passphrases. Mnemonics at passphrase cre-
ation will further prove to be a valuable aid in reducing passphrase reuse. I believe
mnemonics are a promising technique in our quest to improve user authentication. My
last proposed work, GuidedPass, demonstrates the eectiveness in creating strong and
memorable passwords by providing meaningful semantic suggestions. This enables
users to create more memorable and stronger passwords than complex password pol-
icy enforcement.
133
Bibliography
[AH11] Lee Averell and Andrew Heathcote. The form of the forgetting curve and
the fate of memories. Journal of Mathematical Psychology, 55(1):25–35,
2011.
[asi] List of most common surnames in asia. https://en.wikipedia.org/
wiki/List_of_most_common_surnames_in_Asia. Accessed: 2015-10-
14.
[BBC
+
15] Joseph Bonneau, Elie Bursztein, Ilan Caron, Rob Jackson, and Mike
Williamson. Secrets, lies, and account recovery: Lessons from the use
of personal knowledge questions at google. In Proceedings of the 24th
International Conference on World Wide Web, pages 141–150. Interna-
tional World Wide Web Conferences Steering Committee, 2015.
[BCVO12] Robert Biddle, Sonia Chiasson, and Paul C Van Oorschot. Graphical
passwords: Learning from the first twelve years. ACM Computing Sur-
veys (CSUR), 44(4):19, 2012.
[BHvS12] Joseph Bonneau, Cormac Herley, Paul C. van Oorschot, and Frank Sta-
jano. The Quest to Replace Passwords: A Framework for Comparative
Evaluation of Web Authentication Schemes. In 2012 IEEE Symposium
on Security and Privacy, May 2012.
[BHvS15] Joseph Bonneau, Cormac Herley, Paul C. van Oorschot, and Frank Sta-
jano. Passwords and the Evolution of Imperfect Authentication. Com-
munications of the ACM, July 2015.
[bio] Why haven’t biometrics replaced passwords
yet? http://www.digitaltrends.com/android/
can-biometrics-secure-our-digital-lives/.
[BJM10] Joseph Bonneau, Mike Just, and Greg Matthews. What’s in a name?
In Financial Cryptography and Data Security, pages 98–113. Springer,
2010.
134
[BKCD14] Jeremiah Blocki, Saranga Komanduri, Lorrie Cranor, and Anupam Datta.
Spaced Repetition and Mnemonics Enable Recall of Multiple Strong
Passwords. arXiv preprint arXiv:1410.1490, 2014.
[Blo16a] The GitHub Blog. GitHub Security Update:
Reused password attack. https://github.com/blog/
2190-github-security-update-reused-password-attack, 2016.
[Blo16b] The GitHub Blog. GitHub Security Update:
Reused password attack. https://github.com/blog/
2190-github-security-update-reused-password-attack, 2016.
[Bon12a] Joseph Bonneau. The science of guessing: analyzing an anonymized
corpus of 70 million passwords. In 2012 IEEE Symposium on Security
and Privacy, May 2012.
[Bon12b] Joseph Bonneau. The science of guessing: analyzing an anonymized
corpus of 70 million passwords. In 2012 IEEE Symposium on Security
and Privacy, May 2012.
[bqu] Famous Quotes at BrainyQuote. http://www.brainyquote.com/.
[Bra] Brain Authentication. http://brainauth.com/testdrive/.
[BRS87] N. M. Bradburn, L. J. Rips, and S. K. Shevell. Answering autobiograph-
ical questions: the impact of memory and inference on surveys. Science,
236(4798), 1987.
[BS12] Joseph Bonneau and Ekaterina Shutova. Linguistic Properties of Multi-
word Passphrases. In Financial Cryptography and Data Security, pages
1–12. Springer, 2012.
[BSR
+
12] Hristo Bojinov, Daniel Sanchez, Paul Reber, Dan Boneh, and Patrick
Lincoln. Neuroscience meets cryptography: designing crypto primitives
secure against rubber hose attacks. In Proceedings of the 21st USENIX
conference on Security symposium, pages 33–33. USENIX Association,
2012.
[BV15] Manuel Blum and Santosh Srinivas Vempala. Publishable humanly
usable secure password creation schemas. In Third AAAI Conference
on Human Computation and Crowdsourcing, 2015.
135
[BvO11] Kemal Bicakci and Paul C van Oorschot. A Multi-word Password Pro-
posal (gridWord) and Exploring Questions about Science in Security
Research and Usable Security Evaluation. In Proceedings of the 2011
New security paradigms workshop, pages 25–36. ACM, 2011.
[CBBT14] D. Coventry, P. Briggs, J. Blythe, and M. Tran. Using behavioural
insights to improve the public’s use of cyber security best practice. In
Government Oce for Science, London, 2014.
[CDP12] Claude Castelluccia, Markus D¨ urmuth, and Daniele Perito. Adaptive
password-strength meters from markov models. In NDSS, 2012.
[cena] Frequently occurring surnames from census 1990 - names files.
http://www.census.gov/topics/population/genealogy/data/1990_
census/1990_census_namefiles.html. Accessed: 2015-10-14.
[cenb] Frequently occurring surnames from the census 2000. http:
//www.census.gov/topics/population/genealogy/data/2000_
surnames.html. Accessed: 2015-10-14.
[CFBvO08] Sonia Chiasson, Alain Forget, Robert Biddle, and Paul C van Oorschot.
Influencing Users Towards Better Passwords: Persuasive Cued Click-
points. In Proceedings of the 22nd British HCI Group Annual Confer-
ence on People and Computers: Culture, Creativity, Interaction-Volume
1, pages 121–130. British Computer Society, 2008.
[chia] Chinese girl names. http://www.top-100-baby-names-search.com/
chinese-girl-names.html.
[chib] Chinese male names. http://www.top-100-baby-names-search.com/
chinese-male-names.html.
[chic] List of cities in China. https://en.wikipedia.org/wiki/List_of_
cities_in_China.
[chid] List of common chinese surnames. https://en.wikipedia.org/wiki/
List_of_common_Chinese_surnames. Accessed: 2015-10-14.
[CHJPW13] Sadie Creese, Duncan Hodges, Sue Jamison-Powell, and Monica Whitty.
Relationships between password choices, perceptions of risk and security
expertise. In International Conference on Human Aspects of Information
Security, Privacy, and Trust, pages 80–89. Springer, 2013.
136
[CL16] Kate Conger and Matthew Lynley. Dropbox employee’s
password reuse led to theft of 60M+ user cre-
dentials. https://techcrunch.com/2016/08/30/
dropbox-employees-password-reuse-led-to-theft-of-60m-user-credentials/,
2016.
[CMS
+
13] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants,
and Phillipp Koehn. One Billion Word Benchmark for Measuring
Progress in Statistical Language Modeling. CoRR, abs/1312.3005, 2013.
[coc] The corpus of contemporary american english (coca). http://corpus.
byu.edu/coca/. Accessed: 2015-10-14.
[CP] Cognitive password. http://en.wikipedia.org/wiki/Cognitive_
password/.
[CvOB07] Sonia Chiasson, Paul C van Oorschot, and Robert Biddle. Graphical
Password Authentication Using Cued Click Points. In European Sympo-
sium on Research in Computer Security, pages 359–374. Springer, 2007.
[DBC
+
14a] Anupam Das, Joseph Bonneau, Matthew Caesar, Nikita Borisov, and
XiaoFeng Wang. The tangled web of password reuse. In NDSS, vol-
ume 14, pages 23–26, 2014.
[DBC
+
14b] Anupam Das, Joseph Bonneau, Matthew Caesar, Nikita Borisov, and
XiaoFeng Wang. The tangled web of password reuse. In NDSS, vol-
ume 14, pages 23–26, 2014.
[dbp] DBpedia. http://wiki.dbpedia.org/.
[DBvDJ11] Tamara Denning, Kevin Bowers, Marten van Dijk, and Ari Juels. Explor-
ing implicit memory for painless password recovery. In Proceedings of
the 2011 annual conference on Human factors in computing systems,
pages 2615–2618. ACM, 2011.
[DC07] Florencio D. and Herley C. A large-scale study of web password habits.
In Proceedings of the WWW, 2007.
[dCM14] Xavier de Carn´ e de Carnavalet and Mohammad Mannan. From very
weak to very strong: Analyzing password-strength meters. In NDSS,
volume 14, pages 23–26, 2014.
137
[DF15] Matteo Dell’Amico and Maurizio Filippone. Monte Carlo Strength
Evaluation: Fast and Reliable Password Checking. In Proceedings of
the 22nd ACM SIGSAC Conference on Computer and Communications
Security, pages 158–169. ACM, 2015.
[DHH13] Sauvik Das, Eiji Hayashi, and Jason I Hong. Exploring capturable every-
day memory for autobiographical authentication. In Proceedings of the
2013 ACM international joint conference on Pervasive and ubiquitous
computing, pages 211–220. ACM, 2013.
[DHS] Sauvik Das, Jason Hong, and Stuart Schechter. Testing Computer-Aided
Mnemonics and Feedback for Fast Memorization of High-Value Secrets.
Proceedings of the 2016 Usable Security Workshop.
[DMR04a] Darren Davis, Fabian Monrose, and Michael K Reiter. On user choice
in graphical password schemes. In USENIX Security Symposium, vol-
ume 13, pages 11–11, 2004.
[DMR04b] Darren Davis, Fabian Monrose, and Michael K Reiter. On User Choice
in Graphical Password Schemes. In USENIX Security Symposium, vol-
ume 13, pages 11–11, 2004.
[Ebb13] Hermann Ebbinghaus. Memory: A contribution to experimental psychol-
ogy. Number 3. University Microfilms, 1913.
[eng] Best Places to Propose. http://www.brides.com/honeymoons/2014/
04/best-honeymoon-destinations#slide=2.
[ESM
+
13] Serge Egelman, Andreas Sotirakopoulos, Ildar Muslukhov, Konstantin
Beznosov, and Cormac Herley. Does my password go up to eleven?:
the impact of password meters on password selection. In Proceedings of
the SIGCHI Conference on Human Factors in Computing Systems, pages
2379–2388. ACM, 2013.
[eur] List of the most common surnames in europe. https://en.wikipedia.
org/wiki/List_of_the_most_common_surnames_in_Europe.
Accessed: 2015-10-14.
[Fel98] Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database.
MIT Press, Cambridge, Massachusetts, 1998.
[FH07] Dinei Florencio and Cormac Herley. A Large-scale study of Web Pass-
word Habits. In Proceedings of the 16th WWW conference, pages 657–
666. ACM, 2007.
138
[FH10] Dinei Florˆ encio and Cormac Herley. Where do security policies come
from? In Proceedings of the Sixth Symposium on Usable Privacy and
Security, page 10. ACM, 2010.
[FHVO14a] Dinei Florˆ encio, Cormac Herley, and Paul C Van Oorschot. An adminis-
trator’s guide to internet password research. In LISA, pages 35–52, 2014.
[FHVO14b] Dinei Florˆ encio, Cormac Herley, and Paul C Van Oorschot. Password
portfolios and the finite-eort user: Sustainably managing large numbers
of accounts. In 23rd USENIX Security Symposium, pages 575–590, 2014.
[FHVO16a] Dinei Florˆ encio, Cormac Herley, and Paul C Van Oorschot. Pushing on
string: The’don’t care’region of password strength. Communications of
the ACM, 59(11):66–74, 2016.
[FHVO16b] Dinei Florˆ encio, Cormac Herley, and Paul C Van Oorschot. Pushing on
string: The’don’t care’region of password strength. Communications of
the ACM, 59(11):66–74, 2016.
[FK79] W Nelson Francis and Henry Kucera. Brown corpus manual. Brown
University, 1979.
[FNw] List of most popular given names. https://en.wikipedia.org/wiki/
List_of_most_popular_given_names.
[fre] Freebase. http://www.freebase.com/.
[GFN
+
17] Paul A Grassi, James L Fenton, Elaine M Newton, Ray A Perlner,
Andrew R Regenscheid, William E Burr, Justin P Richer, Naomi B
Lefkovitz, Jamie M Danker, YeeYin Choong, et al. Draft nist special
publication 800 63b digital identity guidelines. 2017.
[GJ05] Virgil Grith and Markus Jakobsson. Messin’ with texas: Deriving
mother’s maiden names using public records. In Applied Cryptography
and Network Security, pages 91–103. Springer, 2005.
[goo] google-10000-english. https://github.com/first20hours/
google-10000-english/. Accessed: 2015-10-14.
[Goo16a] Dan Goodin. Then there were 117 mil-
lion. LinkedIn password breach much bigger than
thought. http://arstechnica.com/security/2016/05/
then-there-were-117-million-linkedin-password-breach-much-bigger-than-thought/,
2016.
139
[Goo16b] Dan Goodin. Why passwords have never been weaker, and crackers
have never been stronger. http://arstechnica.com/security/2012/
08/passwords-under-assault/, 2016.
[Gui06] NIST Electronic Authentication Guideline. Nist special publication 800-
63 version 1.0. 2, 2006.
[gut] Top 100 - Project Gutenberg. https://www.gutenberg.org/browse/
scores/top.
[HCM
+
17] Hana Habib, Jessica Colnago, William Melicher, Blase Ur, Sean Segreti,
Lujo Bauer, Nicolas Christin, and Lorrie Cranor. Password creation in
the presence of blacklists. 2017.
[HKF
+
13] Lushan Han, Abhay Kashyap, Tim Finin, James Mayfield, and Jonathan
Weese. UMBC EBIQUITY-CORE: Semantic textual similarity systems.
In Proceedings of the Second Joint Conference on Lexical and Compu-
tational Semantics, volume 1, pages 44–52, 2013.
[Hob] List of hobbies . https://en.wikipedia.org/wiki/List_of_hobbies.
[HOK
+
15] Jun Ho Huh, Seongyeol Oh, Hyoungshick Kim, Konstantin Beznosov,
Apurva Mohan, and S Raj Rajagopalan. Surpass: System-initiated user-
replaceable passwords. In Proceedings of the 22nd ACM SIGSAC Con-
ference on Computer and Communications Security, pages 170–181.
ACM, 2015.
[hon] The World’s Top 20 Honeymoon Destinations. http://www.brides.
com/honeymoons/2014/04/best-honeymoon-destinations#slide=2.
[imd] Top 100 Favorite Movie Quotes. http://www.imdb.com/.
[Inc] Google Inc. Facial recognition. US Patent number 8,457,367.
[inda] Indian name. https://en.wikipedia.org/wiki/Indian_name.
Accessed: 2015-10-14.
[indb] List of towns in India by population. https://en.wikipedia.org/wiki/
List_of_cities_and_towns_in_India_by_population.
[indc] Popular Indian Boy Names. http://babynames.extraprepare.com/
boy-popular.php.
[indd] Popular Indian Girl Names. http://babynames.extraprepare.com/
girl-popular.php.
140
[IS10] Philip G Inglesant and M Angela Sasse. The true cost of unusable pass-
word policies: password use in the wild. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems, pages 383–392.
ACM, 2010.
[JA09] Mike Just and David Aspinall. Personal choice and challenge questions:
a security and usability assessment. In Proceedings of the 5th Symposium
on Usable Privacy and Security, page 8. ACM, 2009.
[JMM
+
99] Ian Jermyn, Alain Mayer, Fabian Monrose, Michael K Reiter, and
Aviel D Rubin. The design and analysis of graphical passwords. In
Proceedings of the 8th USENIX Security Symposium, pages 1–14. Wash-
ington DC, 1999.
[Jur00] Dan Jurafsky. Speech & language processing. Pearson Education India,
2000.
[JYW
+
15] Shouling Ji, Shukun Yang, Ting Wang, Changchang Liu, Wei-Han Lee,
and Raheem Beyah. Pars: A uniform and open-source password anal-
ysis and research system. In Proceedings of the 31st Annual Computer
Security Applications Conference, pages 321–330. ACM, 2015.
[KKM
+
12] Patrick Gage Kelley, Saranga Komanduri, Michelle L Mazurek, Richard
Shay, Timothy Vidas, Lujo Bauer, Nicolas Christin, Lorrie Faith Cra-
nor, and Julio Lopez. Guess again (and again and again): Measuring
password strength by simulating password-cracking algorithms. In Secu-
rity and Privacy (SP), 2012 IEEE Symposium on, pages 523–537. IEEE,
2012.
[Kor] KoreLogic Security. KoreLogic’s Custom rules - DEFCON 2010. http:
//contest-2010.korelogic.com/rules.html.
[KRC] Cynthia Kuo, Sasha Romanosky, and Lorrie Faith Cranor. Human Selec-
tion of Mnemonic Phrase-based Passwords. In Proceedings of the 2006
Symposium on Usable Privacy and Security, pages 67–78.
[KSC
+
] Saranga Komanduri, Richard Shay, Lorrie Faith Cranor, Cormac Herley,
and Stuart Schechter. Telepathwords: Preventing weak passwords by
reading users’ minds.
[KSC
+
14] Saranga Komanduri, Richard Shay, Lorrie Faith Cranor, Cormac Herley,
and Stuart E Schechter. Telepathwords: Preventing weak passwords by
reading users’ minds. In USENIX Security, pages 591–606, 2014.
141
[KSK
+
11] Saranga Komanduri, Richard Shay, Patrick Gage Kelley, Michelle L
Mazurek, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, and Serge
Egelman. Of passwords and people: measuring the eect of password-
composition policies. In Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems, pages 2595–2604. ACM, 2011.
[KSS07] Mark Keith, Benjamin Shao, and Paul John Steinbart. The Usability of
Passphrases for Authentication: An Empirical Field Study. International
journal of human-computer studies, 65(1):17–28, 2007.
[KSS09] Mark Keith, Benjamin Shao, and Paul Steinbart. A Behavioral Analysis
of Passphrase Design and Eectiveness. Journal of the Association for
Information Systems, 10(2):2, 2009.
[lov] Love Poems And Quotes. http://www.lovepoemsandquotes.com/.
[Mag] Forbes Magazine. The World’s Most Valuable Sports Teams.
[Mas] Luke Mastin. The human memory. http://www.human-memory.net/.
[Mas94] James L Massey. Guessing and entropy. In Information Theory, 1994.
Proceedings., 1994 IEEE International Symposium on, page 204. IEEE,
1994.
[MGua] Mnemonic Guard. http://www.mneme.co.jp/english/index.html.
[MGub] Mnemonic Guard Blog. http://mnemonicguard.blogspot.com/.
[mpa] Microsoft corporation, sketch-based password authentication. US Patent
number 8,024,775.
[mtu] Amazon mechanical turk. https://www.mturk.com/.
[NCD05] Ann Nosseir, Richard Connor, and MD Dunlop. Internet authentication
based on personal history – a feasibility test. In Proceedings of Customer
Focused Mobile Services Workshop, 2005.
[Nie03] A. Niedzwienska. Distortion of autobiographical memories. Applied
Cognitive Psychology, 17(1):81–91, 2003.
[nor] List of most common surnames in north america. https:
//en.wikipedia.org/wiki/List_of_most_common_surnames_in_
North_America. Accessed: 2015-10-14.
142
[npd] U.S. Total Restaurant Count Increases by 4,442 Units over Last Year,
Reports NPD. http://www.nytimes.com/interactive/2015/01/11/
travel/52-places-to-go-in-2015.html?_r=0.
[nps] List of national parks of the United States. https://en.wikipedia.org/
wiki/List_of_national_parks_of_the_United_States.
[nyt] 52 Places to Go in 2015. http://www.nytimes.com/interactive/
2015/01/11/travel/52-places-to-go-in-2015.html?_r=0.
[oce] List of most common surnames in oceania. https://en.wikipedia.
org/wiki/List_of_most_common_surnames_in_Oceania. Accessed:
2015-10-14.
[PAM] Pluggable authentication modules for linux (pam). http://www.
linux-pam.org/. Accessed: 2015-10-14.
[par] Party. https://en.wikipedia.org/wiki/Party.
[pbs] How Common is Your Last Name? http://www.pbs.org/pov/
thesweetestsound/popindex.php.
[Phi16] Suzanne Philion. An Important Message About Yahoo User Secu-
rity. https://investor.yahoo.net/releasedetail.cfm?ReleaseID=
990570, 2016.
[Pli] John O Pliam. On the Incomparability of Entropy and Marginal Guess-
work in Brute-force Attacks. In Progress in Cryptology, INDOCRYPT
2000, pages 67–79. Springer.
[PLS] National Center for Education Statistics, How many educational institu-
tions exist in the United States? . https://nces.ed.gov/fastfacts/
display.asp?id=84.
[PLUa] Number of U.S. Colleges and Universities and Degrees Awarded, 2005.
http://www.infoplease.com/ipa/A0908742.html.
[PLUb] World List of Universities, 25th Edition: And Other Institu-
tions of Higher Education (World List of Universities & Other
Institutions of Higher Education). http://www.amazon.com/
World-List-Universities-25th-Edition/dp/1403992525.
[poe] Short Poems. https://www.shortpoems.org/.
[por] The Porter Stemming Algorithm. http://tartarus.org/martin/
PorterStemmer/.
143
[Pyt16] Python SequenceMatcher Objects. https://docs.python.org/2.4/
lib/sequence-matcher.html, 2016.
[res] Top 100 Chains: U.S. Sales. http://nrn.com/us-top-100/
top-100-chains-us-sales.
[RJK13a] Ashwini Rao, Birendra Jha, and Gananand Kini. Eect of Grammar on
Security of Long Passwords. In Proceedings of the third ACM conference
on Data and application security and privacy, pages 317–324, 2013.
[RJK13b] Ashwini Rao, Birendra Jha, and Gananand Kini. Eect of grammar on
security of long passwords. In Proceedings of the third ACM conference
on Data and application security and privacy, pages 317–324. ACM,
2013.
[RMM16] Elissa M Redmiles, Amelia Malone, and Michelle L Mazurek. I Think
They’re Trying To Tell Me Something: Advice Sources and Selection
for Digital Security. In 2016 IEEE Symposium on Security and Privacy,
May 2016.
[Rob] Gabriel Robins. Good Quotations by Famous People. http://www.cs.
virginia.edu/
~
robins/quotes.html.
[SB04] Wayne C Summers and Edward Bosworth. Password policy: the good,
the bad, and the ugly. In Proceedings of the winter international sympo-
sium on Information and communication technologies, pages 1–6. Trin-
ity College Dublin, 2004.
[SB14] Elizabeth Stobert and Robert Biddle. The password life cycle: user
behaviour in managing passwords. In Symposium On Usable Privacy
and Security (SOUPS 2014), pages 243–255, 2014.
[SBE09] Stuart Schechter, AJ Bernheim Brush, and Serge Egelman. It’s no secret.
measuring the security and reliability of authentication via ’secret’ ques-
tions. In Security and Privacy, 2009 30th IEEE Symposium on, pages
375–390. IEEE, 2009.
[SG94] Yishay Spector and Jacob Ginzberg. Pass-sentence – a New Approach to
Computer Code. Computers & Security, 13(2):145–160, 1994.
[Sha01] Claude Elwood Shannon. A mathematical theory of communication.
ACM SIGMOBILE Mobile Computing and Communications Review,
5(1):3–55, 2001.
144
[SK13] Peter Snyder and Chris Kanich. Cloudsweeper: enabling data-centric
document management for secure cloud archives. In Proceedings of the
2013 ACM workshop on Cloud computing security workshop, pages 47–
54. ACM, 2013.
[SKD
+
14] Richard Shay, Saranga Komanduri, Adam L Durity, Phillip Seyoung
Huh, Michelle L Mazurek, Sean M Segreti, Blase Ur, Lujo Bauer, Nico-
las Christin, and Lorrie Faith Cranor. Can long passwords be secure and
usable? In Proceedings of the 32nd annual ACM conference on Human
factors in computing systems, pages 2927–2936. ACM, 2014.
[SKD
+
16] Richard Shay, Saranga Komanduri, Adam L Durity, Phillip Seyoung
Huh, Michelle L Mazurek, Sean M Segreti, Blase Ur, Lujo Bauer, Nico-
las Christin, and Lorrie Faith Cranor. Designing password policies for
strength and usability. ACM Transactions on Information and System
Security (TISSEC), 18(4):13, 2016.
[SKK
+
] Richard Shay, Patrick Gage Kelley, Saranga Komanduri, Michelle L
Mazurek, Blase Ur, Timothy Vidas, Lujo Bauer, Nicolas Christin, and
Lorrie Faith Cranor. Correct Horse Battery Staple: Exploring the Usabil-
ity of System-assigned Passphrases. In Proceedings of the 2012 Sympo-
sium on Usable Privacy and Security, page 7.
[SKK
+
10] Richard Shay, Saranga Komanduri, Patrick Gage Kelley, Pedro Gio-
vanni Leon, Michelle L Mazurek, Lujo Bauer, Nicolas Christin, and Lor-
rie Faith Cranor. Encountering stronger password requirements: user
attitudes and behaviors. In Proceedings of the Sixth Symposium on
Usable Privacy and Security, page 2. ACM, 2010.
[SKK
+
12] Richard Shay, Patrick Gage Kelley, Saranga Komanduri, Michelle L
Mazurek, Blase Ur, Timothy Vidas, Lujo Bauer, Nicolas Christin, and
Lorrie Faith Cranor. Correct horse battery staple: Exploring the usability
of system-assigned passphrases. In Proceedings of the eighth symposium
on usable privacy and security, page 7. ACM, 2012.
[SMB13] Anil Somayaji, David Mould, and Carson Brown. Towards narrative
authentication: or, against boring authentication. In Proceedings of
the 2013 workshop on New security paradigms workshop, pages 57–64.
ACM, 2013.
[sou] List of most common surnames in south america. https:
//en.wikipedia.org/wiki/List_of_most_common_surnames_in_
South_America. Accessed: 2015-10-14.
145
[ssa] John the ripper password cracker. https://www.ssa.gov/OACT/
babynames/limits.html. Accessed: 2015-10-14.
[topa] Occupational Employment Statistics . http://www.bls.gov/oes/
current/oes_nat.htm.
[topb] 10 Tried-And-True Wedding Flowers. https://www.theknot.com/
content/top-10-wedding-flowers.
[topc] 100 Greatest Actors of All Time. http://www.imdb.com/list/
ls000034841/.
[topd] 50 College Graduation Gift Ideas. https://
www.universityparent.com/topics/parent-posts/
50-college-graduation-gift-ideas-for-parents-2/.
[tope] Academic Ranking of World Universities . https://en.wikipedia.
org/wiki/Academic_Ranking_of_World_Universities.
[topf] Highest Rated TV Series With At Least 5,000 V otes . http://www.imdb.
com/chart/top.
[topg] List of best-selling books. https://en.wikipedia.org/wiki/List_of_
best-selling_books.
[toph] List of best-selling fiction authors. https://en.wikipedia.org/wiki/
List_of_best-selling_fiction_authors/.
[topi] List of best-selling music artists. https://en.wikipedia.org/wiki/
List_of_best-selling_music_artists.
[topj] List of largest hotels in the world. https://en.wikipedia.org/wiki/
List_of_largest_hotels_in_the_world.
[topk] List of most commonly learned foreign languages in the United
States. https://en.wikipedia.org/wiki/List_of_most_commonly_
learned_foreign_languages_in_the_United_States.
[topl] List of sports attendance figures. https://en.wikipedia.org/wiki/
List_of_sports_attendance_figures/.
[topm] Sports in the United States. https://en.wikipedia.org/wiki/Sports_
in_the_United_States.
146
[topn] Top 10 Colors for Bridesmaid Dresses.
http://www.tulleandchantilly.com/blog/
top-10-colors-for-bridesmaid-dresses/.
[topo] Top 25 Most Popular Sports/Recreational Activities in the U.S. . https:
//www.sfia.org/.
[topp] Top Rated Movies . http://www.imdb.com/search/title?num_votes=
5000,&sort=user_rating,desc&title_type=tv_series.
[topq] Top Ten Most Requested Cake Combos. http://nymag.com/shopping/
guides/weddings/planner/features/topten_cakes.htm.
[topr] Your pick: World’s 50 best foods .
http://travel.cnn.com/explorations/eat/
readers-choice-worlds-50-most-delicious-foods-012321.
[tri] Tripadviser. http://www.tripadvisor.com/.
[UAA
+
17] Blase Ur, Felicia Alfieri, Maung Aung, Lujo Bauer, Nicolas Christin,
Jessica Colnago, Lorrie Cranor, Harold Dixon, Pardis Emami Naeini,
Hana Habib, Noah Johnson, and William Melicher. Design and evalu-
ation of a data-driven password meter. In CHI’17: 35th Annual ACM
Conference on Human Factors in Computing Systems, May 2017. To
appear. ¡b¿¡i¿CHI Best Paper¡/i¿¡/b¿.
[UBS
+
16] Blase Ur, Jonathan Bees, Sean Segreti, Lujo Bauer, Nicolas Christin, and
Lorrie Faith Cranor. Do users’ perceptions of password security match
reality? In CHI’16: 34th Annual ACM Conference on Human Factors in
Computing Systems. ACM, May 2016.
[UCR16] UCREL CLAWS7 Tagset. http://ucrel.lancs.ac.uk/claws7tags.
html, 2016.
[UKK
+
12] Blase Ur, Patrick Gage Kelley, Saranga Komanduri, Joel Lee, Michael
Maass, Michelle L Mazurek, Timothy Passaro, Richard Shay, Timothy
Vidas, Lujo Bauer, et al. How does your password measure up? the eect
of strength meters on password creation. In USENIX Security Sympo-
sium, pages 65–80, 2012.
[UNB
+
15] Blase Ur, Fumiko Noma, Jonathan Bees, Sean M. Segreti, Richard Shay,
Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. ”I added ‘!’ at
the end to make it secure”: Observing password creation in the lab. In
SOUPS ’15: Proceedings of the 11th Symposium on Usable Privacy and
Security. USENIX, July 2015.
147
[USC] List of United States cities by population. https://en.wikipedia.org/
wiki/List_of_United_States_cities_by_population.
[VCT14a] Rafael Veras, Christopher Collins, and Julie Thorpe. On semantic pat-
terns of passwords and their security impact. In NDSS, 2014.
[VCT14b] Rafael Veras, Christopher Collins, and Julie Thorpe. On the semantic
patterns of passwords and their security impact. In Network and Dis-
tributed System Security Symposium (NDSS’14), 2014.
[VZDLH13] Emanuel V on Zezschwitz, Alexander De Luca, and Heinrich Hussmann.
Survival of the shortest: A retrospective analysis of influencing factors on
password composition. In IFIP Conference on Human-Computer Inter-
action, pages 460–467. Springer, 2013.
[W3S] W3Schools. onbeforeunload Event. http://www.w3schools.com/
jsref/event_onbeforeunload.asp.
[WADMG09] Matt Weir, Sudhir Aggarwal, Breno De Medeiros, and Bill Glodek. Pass-
word cracking using probabilistic context-free grammars. In Security and
Privacy, 2009 30th IEEE Symposium on, pages 391–405. IEEE, 2009.
[Wei] Matt Weir. Quotes to use in pass-phrase cracking. https://sites.
google.com/site/reusablesec/Home/custom-wordlists.
[Whe16] Dan Lowe Wheeler. zxcvbn: Low-budget password strength estimation.
In Proc. USENIX Security, 2016.
[wik04] Wikipedia, the free encyclopedia, 2004. [Online; accessed 12-Feb-
2016].
[wor] List of cities proper by population. https://en.wikipedia.org/wiki/
List_of_cities_proper_by_population.
[WRBW16] Rick Wash, Emilee Rader, Ruthie Berman, and Zac Wellmer. Under-
standing password choices: How frequently entered passwords are re-
used across websites. In Symposium on Usable Privacy and Security
(SOUPS), 2016.
[YHIN08] Takumi Yamamoto, Atsushi Harada, Takeo Isarida, and Masakatsu
Nishigaki. Improvement of user authentication using schema of visual
memory: Exploitation of ”schema of story”. In Advanced Informa-
tion Networking and Applications, 2008. AINA 2008. 22nd International
Conference on, pages 40–47. IEEE, 2008.
148
[ZH93] Moshe Zviran and William J. Haga. A Comparison of Password Tech-
niques for Multilevel Authentication Mechanisms. The Computer Jour-
nal, 36(3):227–237, 1993.
149
Abstract (if available)
Abstract
Textual passwords are widely used for user authentication, but they are often difficult for a user to recall, easily cracked by automated programs, and heavily reused. Weak or reused passwords are responsible for many contemporary security breaches. Hence, it is critical to study both how users choose and reuse passwords, and the reasons that they adopt unsafe practices. ❧ In this thesis I first examine the reasons why people create weak passwords and reuse them over multiple accounts. My research complements the body of existing work by studying the semantic structure, strength and reuse of real passwords, as well as conscious and unconscious causes of unsafe practices. To do this, I used a test group population of 50 participants. Significant reuse and weak passwords clearly demonstrate the need for alternative authentication methods that are more memorable, secure, and less reused. My next three key thesis topics focus on developing novel authentication mechanisms that can directly improve current approaches. The first topic, life-experience passwords (LEPs), uses a person’s prior life experience as information to generate more memorable and secure authentication questions. We show that LEPs significantly raise the level of memorability and security compared to existing passwords and security questions. My second topic constructs more memorable and more secure passphrases through the novel use of mnemonics—multi-letter abbreviations of passphrases (MNPass) composed of the first letters of each word in a passphrase. I apply mnemonics when generating and authenticating passphrases and show that the mnemonics-based approach improves recall compared to randomly generated passphrases, and enhances strength compared to user-selected passphrases. My third topic explores password creation with semantic feedback (GuidedPass). I analyze user-input passwords and provide real-time, specific suggestions for improvement based on their existing semantic structure. A GuidedPass password is 10⁷ times stronger than a user’s initial passwords and has good recall. GuidedPass passwords are also 100 times stronger and have a 20% higher recall than passwords created with only password-meter feedback.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Towards addressing spatio-temporal aspects in security games
PDF
Mitigating attacks that disrupt online services without changing existing protocols
PDF
Neural creative language generation
PDF
Building a knowledgebase for deep lexical semantics
PDF
When AI helps wildlife conservation: learning adversary behavior in green security games
PDF
Side-channel security enabled by program analysis and synthesis
PDF
Hierarchical planning in security games: a game theoretic approach to strategic, tactical and operational decision making
PDF
Game theoretic deception and threat screening for cyber security
PDF
Trade-offs among attributes of authentication
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Learning semantic types and relations from text
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Protecting online services from sophisticated DDoS attacks
PDF
Improving network security through collaborative sharing
PDF
Modeling, searching, and explaining abnormal instances in multi-relational networks
PDF
Balancing tradeoffs in security games: handling defenders and adversaries with multiple objectives
PDF
Semantic structure in understanding and generation of the 3D world
PDF
Security-driven design of logic locking schemes: metrics, attacks, and defenses
PDF
Transfer learning for intelligent systems in the wild
Asset Metadata
Creator
Woo, Simon S.
(author)
Core Title
Memorable, secure, and usable authentication secrets
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/18/2017
Defense Date
07/14/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,passphrase,password,password meter,security questions,usable security
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mirkovic, Jelena (
committee chair
), Artstein, Ron (
committee member
), Kaiser, Elsi (
committee member
), Knight, Kevin (
committee member
)
Creator Email
simonwoo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-401883
Unique identifier
UC11265756
Identifier
etd-WooSimonS-5533.pdf (filename),usctheses-c40-401883 (legacy record id)
Legacy Identifier
etd-WooSimonS-5533.pdf
Dmrecord
401883
Document Type
Dissertation
Rights
Woo, Simon S.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
passphrase
password
password meter
security questions
usable security