Reducing Bias in Modeling Real-world Password Strength via Deep Learning and Dynamic Dictionaries - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Reducing Bias in Modeling Real-world Password Strength via Deep Learning and Dynamic Dictionaries Dario Pasquini†,§ , Marco Cianfriglia§ , Giuseppe Ateniese‡ and Massimo Bernaschi§ † Sapienza University of Rome, ‡ Stevens Institute of Technology, § Institute of Applied Computing CNR arXiv:2010.12269v3 [cs.CR] 12 Dec 2020 Abstract security parameter (e.g., the key size). The only way to estab- lish the soundness of a system is to learn and model attackers’ Password security hinges on an accurate understanding of the capabilities. To this end, simulating password guessing at- techniques adopted by attackers. Unfortunately, real-world tacks has become a requisite practice. (1) Administrators rely adversaries resort on pragmatic guessing strategies such as on cracking sessions to reactively evaluate the security of their dictionary attacks that are inherently difficult to model in accounts. (2) Researchers use password guessing techniques password security studies. In order to be representative of to validate the soundness of proactive password checking ap- the actual threat, dictionary attacks must be thoughtfully con- proaches [34,46]. Ultimately, modeling attackers’ capabilities figured and tuned. However, this process requires a domain- is critical to ensure the security of passwords. knowledge and expertise that cannot be easily replicated by In this direction, more than three decades of active research researchers and security practitioners. The consequence of provided us with powerful password models [34, 36, 37, 45]. inaccurately calibrating those attacks is the unreliability of However, very little progress has been made to systemati- password security analyses, impaired by a severe measure- cally model real-world attackers [32,43]. Indeed, professional ment bias. password crackers rarely harness fully-automated approaches In the present work, we introduce new guessing techniques developed in academia. They rely on more pragmatic guessing that make dictionary attacks consistently more resilient to techniques that present stronger inductive biases. In offline inadequate configurations. Our framework allows dictionary attacks, professionals use high-throughput and flexible tech- attacks to self-heal and converge towards optimal attacks’ per- niques such as dictionary attacks with mangling rules [1]. formance, requiring no supervision or domain-knowledge. To Moreover, they rely on highly tuned setups that result from achieve this: (1) We use a deep neural network to model and profound expertise that is refined over years of practical ex- then simulate the proficiency of expert adversaries. (2) Then, perience [32, 43]. However, reproducing or modeling these we introduce dynamic guessing strategies within dictionary proprietary attack strategies is very difficult, and the end re- attacks. These mimic experts’ ability to adapt their guess- sults rarely mimic actual real-world threats [43]. This failure ing strategies on the fly by incorporating knowledge on their often results in an overestimation of password security that targets. sways studies’ conclusions and further jeopardize password- Our techniques enable more robust and sound password based systems. strength estimates within dictionary attacks, eventually reduc- ing bias in modeling real-world threats in password security. In the present work, we develop a new generation of dictionary attacks that more closely resembles real-world attackers’ abilities and guessing strategies. In the process, 1 Introduction we devise two complementary techniques that aim to systematically mimic different attackers’ behaviors: Passwords have proven to be irreplaceable. They are still preferred over safer options and appear essential in fallback By rethinking the underlying framework, we devise the mechanisms. However, users tend to select their passwords as Adaptive Mangling Rules attack. This artificially simulates easy-to-remember strings, which results in very skewed distri- the optimal configurations harnessed by expert adversaries butions that can be easily modeled by an attacker. This makes by explicitly handling the conditional nature of mangling passwords and authentication systems that implement them rules. Here, during the attack, each word from the dictionary inherently susceptible to guessing attacks. In this scenario, the is associated with a dedicated and possible unique rules-set security of the authentication protocol cannot be stated via a that is created at runtime via a deep neural network. Using 1
this technique, we confirmed that standard attacks, based on easy-to-remember passwords that aggregate in relatively off-the-shelf dictionaries and rules-sets, are sub-optimal and few dense clusters. Real-world passwords, therefore, tend can be easily compressed up to an order of magnitude in the to cluster in very bounded distributions that can be modeled number of guesses. Furthermore, we are the first to explicitly by an attacker, making authentication-systems intrinsically model the strong relationship that bounds mangling rules and susceptible to guessing attacks. In a guessing attack, the dictionary words, demonstrating its connection with optimal attacker aims at recovering plaintext credentials by attempting configurations. several candidate passwords (guesses) till success or budget Our second contribution introduces dynamic guessing exhaustion; this happens by either searching for collisions strategies within dictionary attacks [37]. Real-world of password hashes (offline attack) or attempting remote adversaries perform their guessing attacks incorporating logins (online attack). In this process, the attacker relies on a prior knowledge on the targets and dynamically adjusting so-called password model that defines which, and in which their guesses during the attack. In doing so, professionals order, guesses should be tried to maximize the effectiveness seek to optimize their configurations and maximize the of the attack (see Section 2.4). number of compromised passwords. Unfortunately, automatic guessing techniques fail to model this adversarial behavior. Instead, we demonstrate that dynamic guessing strategies Generally speaking, a password model can be understood as can be enabled in dictionary attacks and substantially a suitable estimation of the password distribution that enables improve the guessing attack’s effectiveness while requiring an educated exploration of the key-space. Existing password no prior optimization. More prominently, our technique models construct over a heterogeneous set of assumptions and makes dictionary attacks consistently more resilient to rely on either intuitive or rigorous security definitions. From miss-configurations by promoting the completeness of the the most practical point of view, those can be divided into two dictionary at runtime. macro-classes, i.e., parametric and nonparametric password models. Finally, we combine these methodologies and introduce the Adaptive Dynamic Mangling rules attack (AdaMs). We show Parametric approaches build on top of probabilistic reason- that it automatically causes the guessing strategy to progress ing; they assume that real-world password distributions are towards an optimal one, regardless of the initial attack setup. sufficiently smooth to be accurately described from suitable The AdaMs attack consistently reduces the overestimation parametric probabilistic models. Here, a password mass func- induced by inexpert configurations in dictionary attacks, en- tion is explicitly [34, 36] or implicitly [37] derived from a abling more robust and sound password strength estimates. set of observable data (i.e., previously leaked passwords) and used to assign a probability to each element of the key-space. Organization: Section 2 gives an overview of the funda- During the guessing attack, guesses are produced by travers- mental concepts needed for the comprehension of our con- ing the key-space following the decreasing probability order tributions. In Section 3, we introduce Adaptive Mangling imposed by the modeled mass function. These approaches are, Rules aside the intuitions and tools on which those are based. in general, relatively slow and unsuitable for practical offline Section 4 discusses dynamic mangling rules attacks. Finally, attacks. Although simple models such as Markov Chains can Section 5 aggregates the previous methodologies, introduc- be employed [9], more advanced and effective models such as ing the AdaMs attack. The motivation and evaluation of the the neural network ones [34,37] are hardly considered outside proposed techniques are presented in their respective sections. the research domain due to their inefficiency. Section 6 concludes the paper, although supplementary infor- mation is provided in the Appendices. Nonparametric models such as Probabilistic Context-Free Grammars (PCFG) and dictionary attacks rely on simpler and 2 Background and preliminaries more intuitive constructions, which tend to be closer to human logic. Generally, those assume passwords as realizations of In this Section, we start by covering passwords guessing at- templates and generate novel guesses by abstracting and ap- tacks and their foundations in Section 2.1. In Section 2.2, we plying such patterns on ground-truth. These approaches main- focus on dictionary attacks that are the basis of our contri- tain a collection of tokens that are either directly given as part butions. Next, Section 2.3 briefly discusses relevant related of the model configuration (e.g., the dictionary and rules-set works. Finally, we define the threat model in Section 2.4. for dictionary attack.) or extracted from observed passwords in a setup phase (e.g., terminals/grammar for PCFG). In con- trast with parametric models, these can produce only a limited 2.1 Password Guessing number of guesses, which is a function of the chosen configu- Human-chosen passwords do not distribute uniformly in ration. A detailed discussion on dictionary attacks follows in the exponentially large key-space. Users tend to choose the next Section. 2
2.2 Dictionary Attacks Furthermore, real-world attackers update their guessing strat- egy dynamically during the attack [43]. Basing on prior Dictionary attacks can be traced back to the inception of knowledge and the initially matched passwords, they tune password security studies [35, 41]. They stem from the obser- their guesses generation process to describe their target set of vation that users tend to pick their passwords from a bounded passwords better and eventually recover more of them. To this and predictable pool of candidates; common natural words end, professionals prefer extremely flexible tools that allow and numeric patterns dominate most of this skewed distribu- for fast and complete customization. While the state-of-the- tion [40]. An attacker, collecting such strings (i.e., creating art probabilistic models fail at that, mangling rules make any a dictionary/wordlist), can use them as high-quality guesses form of customization feasible as well as natural. during a guessing attack, rapidly covering the key-space’s densest zone. These dictionaries are typically constructed by aggregating passwords revealed in previous incidents and plain-word dictionaries. 2.3 Related Works Although dictionary attacks can produce only a limited number of guesses1 , these can be extended through man- Although dictionary attacks are ubiquitous in password secu- gling rules. Mangling rules attacks describe password dis- rity research [20, 23, 24, 30, 34], little effort has been spent tributions by factorizing guesses in two main components: studying them. This Section covers the most relevant contri- (1) dictionary-words and (2) string transformations (man- butions. gling rules). These transformations aim at replicating users Ur et al. [43] firstly made explicit the large performance composition behavior such as leeting or concatenating digits gap between optimized and stock configurations for mangling (e.g., “pa$$w0rd" or “password123") [26]. Mangling trans- rules attacks. In their work, Ur et al. recruited professional formations are modeled by the attacker and collected in sets figures in password recovery and compared their performance (i.e., rules-sets). During the guessing attack, each dictionary against off-the-shelf parametric/nonparametric approaches in word is extended in real-time through mangling rules, creating different guessing scenarios. Here, professional attackers have novel guesses that augment the guessing attack’s coverage been shown capable of vastly outperform any password model. over the key-space. Hereafter, we use the terms dictionary This thanks to custom dictionaries, proprietary mangling rules, attack and mangling rules attack interchangeably. and the ability to create tailored rules for the attacked set of Most widely known implementations of mangling rules are passwords (referred to as freestyle rules). Finally, the authors included in the password cracking software Hashcat [6] and show that the performance gap between professional and non- John the Ripper [8] (JtR). Here, mangling rules are encoded professional attacks can be reduced by combining guesses of through simple custom programming languages. Hashcat multiple password models. and JtR share almost overlapping mangling rules languages, More recently, Liu et al. [32] produced a set of tools that can although few peculiar instructions are unique to each tool. be used to optimize the configuration of dictionaries attacks. However, they consistently differ in the way mangling rules These solutions extend previous approaches [3, 7], making are applied during the attack. Hashcat follows a word-major them faster. Their core contribution is an algorithm capable order, where all the rules of the rule-set are applied to a single of inverting almost all mangling rules; that is, given a rule r dictionary-word before the next dictionary word is considered. and password to evaluate p, the inversion-rule function pro- In contrast, JtR follows a rule-major order, where a rule is duces as output a regex that matches all the preimages of r(p) applied to all the dictionary words before moving to the next i.e., all the dictionary entries that transformed by r would rule. In our work, we rely on the approach of Hashcat as produce p. At the cost of an initial pre-computation phase, the word-major order is necessary to efficiently implement following this approach, it is possible to count dictionary- the adaptive mangling rules attack that we introduce in Sec- words/mangling-rules hits on an attacked set without enu- tion 3.3. merating all the possible guesses. Liu et al. used the method The community behind these software packages developed to optimize the ordering of mangling rules in a rules-set by numerous mangling rules sets that are publicly available. sorting them in decreasing hits-count order.2 In doing so, the Despite their simplicity, mangling rules attacks represent authors observed that default rules-sets follow an optimal or- a substantial threat in offline password guessing. Mangling dering only rarely. rules are extremely fast and inherently parallel; they are nat- Basing on the same general approach, they speedup the auto- urally suited for both parallel hardware (i.e., GPUs) and dis- matic generation of mangling rules [3] and augment dictio- tributed setups, making them one of the few guessing ap- naries by adding missing words in consideration of known proaches suitable for large-scale attacks (e.g., botnets). attacked sets [7]. Similarly, they derive an approximate guess- number calculator for rule-major order attacks. 1 The required disk space inherently bounds the number of guesses issued from plain dictionary attacks. Guessing attacks can easily go beyond 1012 guesses, and storing such a quantity of strings is not practical. 2 Primarily, for rule-major order setups (e.g., JtR). 3
2.4 Threat Model and rules-sets that are not as effective as advanced configura- tions adopted by professionals. Unavoidably, this leads to a In our study, we primarily model the case of trawling, offline constant overestimation of password strength that skews the attacks. Here, an adversary aims at recovering a set of pass- conclusions of studies and reactive analysis. words X (also referred to as attacked-set) coming from an Hereafter, we show that the domain-knowledge of profes- arbitrary password distribution P(x) by performing a guess- sional attackers can be suitably approximated with a Deep ing attack. To better describe both the current trend in pass- Neural Network. Given that, we devise a new dictionary word storing techniques [27, 38, 39] and real-world attackers’ attack that autonomously promotes functional interaction be- goals [17], we assume a rational attacker who is bound to tween the dictionary and the rules-set, implicitly simulating produce a limited number of guesses. More precisely, this at- the precision of real-world attackers’ configurations. tacker aims at maximizing the number of guessed passwords We start by presenting the intuition behind our technique. in X given a predefined budget i.e., a maximal number of Formalization and methodology are reported later. guesses the attacker is willing to perform on X. Hereafter, we model this strategy under the form of β-success-rate [18,19]: β 3.1 The conditional nature of mangling rules sβ (X) = ∑ P(xi ). (1) As introduced in Section 2.2, dictionary attacks describe i=1 password distributions by factorizing guesses into two main components—a dictionary word w and a transformation rule r. Experimental setup In our construction, we do not impose Here, the word w acts as a semantic base, whereas r is a syn- any limitation on the nature of P(x) nor the attacker’s a priori tactic transformation that aims at providing a suitable guess knowledge. However, in our experiments, we consider a weak through the manipulation of w. Generally speaking, such fac- attacker who does not retain any initial knowledge of the tar- torized representation can be thought of as an approximation get distribution i.e., who cannot provide an optimal attack of the typical users’ composition behavior: starting from a configuration for X before the attack. This last assumption plain word or phrase, users manipulate it by performing oper- makes a better description of the use-case of automatic guess- ations such as leeting, appending characters or concatenation. ing approaches currently used in password security studies. At configuration time, such transformations are abstracted In the attacks reported in the paper, we always sort the and collected in arbitrary large rules-sets under the form of words in the dictionary according to their frequency. The mangling rules. Then, during the attack, guesses are repro- password leaks that we use through the paper are listed in duced by exhaustively applying the collected rules on all the Appendix A. words in the dictionary. In this generation process, rules are applied unconditionally on all the words, assuming that the 3 The Adaptive Mangling Rules attack abstracted syntactic transformations equally interact with all the elements in the dictionary. However, arguably, users do In this Section, we introduce the first core block of our pass- not follow the same simplistic model in their password com- word model: the Adaptive Mangling Rules. We start in Sec- position process. Users first select words and then mangling tion 3.1, where we make explicit the conditional nature of transformations conditioned by those words. That is, man- mangling rules while discussing its connection with optimal gling transformations are subjective and depend on the base attack configurations. In Section 3.2, we model the functional words on which those are applied. For instance, users may relationship connecting mangling rules and dictionary words prefer to append digits at the end of a name (e.g., “jimmy" via a deep neural network. Finally, leveraging the introduced to “jimmy91"), repeat short words rather than long ones tools, we establish the Adaptive Mangling Rules attack in (e.g., “why" to “whywhywhy") or capitalize certain strings Section 3.3. over others (e.g., “cookie" to “COOKIE"). Pragmatically, we can think of each mangling rule as a Motivation: Dictionary attacks are highly sensitive to their function that is valid on an arbitrary small subset of the dictio- configuration; while parametric approaches tend to be more nary space, strictly defined by the users’ composition habits. robust to train-sets and hyper-parameters choices, the per- Thus, applying a mangling rule on words outside this domain formance of dictionary attacks crucially depends on the se- unavoidably brings it to produce guesses that have only a neg- lected dictionary and rules-set [32, 43]. As evidenced by ligible probability of inducing hits during the guessing attack Ur et al. [43], real-world attackers rely on extremely opti- (i.e., that do not replicate users’ behavior). This concept is cap- mized configurations. Here, dictionaries and mangling rules tured in Figure 1, where four panels depict the hits distribution are jointly created over time through practical experience [1], of the rules-set “best64" for four different dictionaries. Each harnessing a domain knowledge and expertise that is mostly dictionary represents a specific subset of the dictionary space unknown to the academic community [32]. Very often, pass- that has been obtained by filtering out suitable strings from word security studies rely on publicly available dictionaries the RockYou leak; namely, these are passwords composed of: 4
1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 (a) Only digits. (b) Only capital letters. (c) Strings of length 5. (d) Strings of length 10. Figure 1: Distribution of hits per rule for 4 different input dictionaries for the same attacked-set i.e., animoto. Within a plot, each bar depicts the normalized number of hits for one of the 77 mangling rules in best64. We performed the attack with Hashcat. digits (Figure 1a), capital letters (Figure 1b), passwords of (i.e., the set of all possible dictionary words), respectively. length 5 (Figure 1c), and passwords of length 10 (Figure 1d). Values of π(w, r) close to 1 indicate that the transformation The four histograms show how mangling rules selectively induced by r is well-defined on w and would lead to a valuable and heterogeneously interact with the underlying dictionar- guess. Values close to 0, instead, indicate that users would not ies. Rules that produce many hits for a specific dictionary apply r over w, i.e., guesses will likely fall outside the dense inevitably perform very poorly with the others. zone of the password distribution. Eventually, the conditional nature of mangling rules has a This formalization of compatibility function also leads to critical impact in defining the effectiveness of a dictionary a straightforward probabilistic interpretation that better sup- attack. To reach optimal performance, an attacker has to ports the learning process through a neural network. Indeed, resort on a setup that a priori maximizes the conditional we can think of π as a probability function over the event: effectiveness of mangling rules. In this direction, we can see highly optimized configurations used by experts as pairs r(w) ∈ X, of dictionaries and rules-sets that organically support each other in the guesses generation process.3 On the other where X is an abstraction of the attacked set of passwords. hand, configurations based on arbitrary chosen rule-sets and More precisely, we have that: dictionaries may not be fully compatible and, as we show later in the paper, generate a large number of low-quality ∀w∈W, r∈R π(r, w) = P(r(w) ∈ X) . guesses. Unavoidably, this phenomenon makes adversary models based on mangling rules inaccurate, and induce an In other words, P(r(w) ∈ X) is the probability of guessing an overestimation of password strength [43]. element of X by trying the guess g = r(w) produced by the application of r over w. Next, we show how modeling the conditional nature of Furthermore, such a probability can be seen as an unnormal- mangling rules allows us to cast dictionary attacks that are ized version of the password distribution, creating a direct inherently more resilient to poor configurations. link to probabilistic password models [34, 36] (more details are given in Appendix C). However, here, the password dis- tribution is defined over the factorized domain R × W rather 3.2 A Model of Rule/Word Compatibility than directly over the key-space. We introduce the notion of compatibility that refers to the This factorized form offers us practical advantages over the functional relation among dictionary words and mangling classic formulation. More in detail, by choosing and fixing rules discussed in the previous Section. The compatibility a specific rule-space R (i.e., a rules-set), we can reshape the can be thought of as a continuous value defined between compatibility function as: a mangling rule r and a dictionary-word w that, intuitively, measures the utility of applying the rule r on w. More formally, πR : W → [0, 1]|R| . (2) we model compatibility as a function: This version of the compatibility function takes as input a π : R × W → [0, 1], dictionary-word and outputs a compatibility value for each rule in the chosen rule-set with a single inference. This form where R and W are the rule-space (i.e., the set of all the is concretely more computational convenient and will be used suitable transformations r : W → W) and the dictionary-space to model the neural approximation of the compatibility func- 3 This has also been indirectly observed by Ur et al. in their ablation study tion. on pro’s guessing strategy, where the greatest improvement was achieved Next, we show how the compatibility function can be in- with a proprietary dictionary in tandem with a proprietary rules-set. ferred from raw data using a deep neural network. 5
Name Cardinality Brief Description 3.2.1 Learning the compatibility function PasswordPro 3120 Manually produced. As stated before, the probabilistic interpretation of the com- generated 14728 Automatically generated. patibility function makes it possible to learn π using a neural generated2 65117 Automatically generated. network. Indeed, the probability P(r(w) ∈ X), in any form, can be described through a binary classification: for each Table 1: Used Hashcat’s mangling rules sets. pair word/rule (w, r), we have to predict one of two possi- ble outcomes: g ∈ X or g 6∈ X, where g = r(w). In solving this classification task, we can train a neural network in a that is, the same password can be guessed multiple times by logistic regression and obtain a good approximation of the different combinations of rules and words. This is necessary probability P(r(w) ∈ X). to correctly model the functional compatibility. In the same In the same way, the reshaped formulation of π (i.e., Eq. 2) way, we do not consider the identity mangling rule (i.e., ’:’) describes a multi-label classification. In a multi-label clas- in the construction of the training set. When it occurs, we sification, each input participates simultaneously to multiple remove it from the rules set. To the same end, we do not binary classifications i.e., an input is associated with multi- consider hits caused by conditional identity transformations ple classes at the same time. More formally, having a fixed i.e., r(w) = w. number of possible classes n, each data point is mapped to a binary vector in {0, 1}n . In our case, n = |R| and each bit Training set configuration The creation of a training set in the binary vector corresponds to the outcome of the event entails the proper selection of the sets XA and W as well as r j (w) ∈ X for a rule r j ∈ R. the rules-set R. Arguably, the most critical choice is the set To train a model, then, we have to resort to a supervised XA , as this is the ground-truth on which we base the approxi- learning approach. We have to create a suitable training-set mation of the compatibility function. In our study, we select composed of pairs (input,label) that the neural network can XA to be the password leak discovered by 4iQ in the Dark model during the training. Under our construction, we can Web [12]. We completely anonymized all entries by removing easily produce such suitable labels by performing a mangling users’ information, and obtained a set of ∼ 4 · 108 of unique rules attack. In particular, fixed a rules-set R, we collect pairs passwords. We use this set as XA within our models. (wi , yi ), where wi is the input to our model (i.e., a dictionary- Similarly, we want W to be a good description of the word) and yi is the label vector associated with wi . As expli- dictionary-space. However, in this case, we exploit the gener- cated before, the label yi asserts the membership of the list of alization capability of the neural network that can automat- guesses [r1 (wi ), r2 (wi ), . . . , r|R| (wi )] over a hypothetical target ically infer a general description of the input space from a set of passwords X i.e., : relatively small training set. In our experiments, we use the LinkedIn leak as W . yi = [r1 (wi ) ∈ X, r2 (wi ) ∈ X, . . . , r|R| (wi ) ∈ X] (3) Finally, we train three neural networks that learn the com- patibility function for three different rules-sets; namely Pass- To collect labels, then, we have to concertize X by choosing a wordPro, generated and generated2. Those sets are provided representative set of passwords. Intuitively, such a set should with the Hashcat software and widely studied in previous be sufficiently large and diverse since it describes the entire works [32, 34, 37]. Table 1 lists them along with some addi- key-space. Hereafter, we refer to this set as XA . This is the tional information. set of passwords we attack during the process of collecting Eventually, the labels we collect in the guessing process are labels. extremely sparse. In our experiments, more than 95% of the In the same way, we have to choose another set of strings W guesses are a miss, causing our training-set to be extremely that represents and generalizes the dictionary-space. This is unbalanced towards the negative class. used as input to the neural network during the training process, and as the input dictionary during the simulated guessing attack. Details on the adopted set are given at the end of the Model definition and training We construct our model section. over a residual structure [25] primarily composed of mono- Finally, given XA and W , and chosen a rules-space R, we dimensional convolution layers. Here, input strings are first construct the set of labels by simulating a guessing attack; embedded at character-level via a linear transformation; then, that is, for each entry wi in the dictionary W , we collect the a series of residual blocks are sequentially applied to extract label vector yi (E.q. 3). In doing so, we used a modified ver- a global representation for dictionary words. Finally, such sion of Hashcat described in Appendix H. Alternatively, the representations are mapped into the label-space by means technique proposed in [32] can be used to speedup the labels of a single, linear layer that performs the classification task. collection. This architecture is trained in a multi-label classification; each Unlike the actual guessing attack, in the process, we do not output of the final dense layer is squashed in the interval [0, 1] remove passwords from XA when those are guessed correctly; via the logit function, and binary cross entropy is applied 6
to each probability separately. The network’s loss is then We implemented our framework on TensorFlow; the mod- obtained by summing up all the cross-entropies of the |R| els have been trained on a NVIDIA DGX-2 machine. A com- classes. plete description of the architectures employed is given in As mentioned in the previous Section, our training-set is Appendix D. Additionally, Appendix I contains additional extremely unbalanced toward the negative class; that is, the remarks on the neural approximation of the compatibility vast majority of the ground-truth labels assigned to a training function. instance are negative. Additionally, a similar disproportion Ultimately, we obtain three different neural networks: one appears in the distribution per rule. Typically, we have many for each rule-set reported in Table 1. Summing up, each neural rules that count only a few positive examples, whereas others network is an approximation of the compatibility function πR have orders of magnitude more hits. In our framework, we for the respective rules-set R that is capable of assigning a alleviate the negative effects of those disproportions by induc- compatibility score to each rule in |R| with a single network tive bias. In particular, we achieve it by considering a focal inference i.e., Eq. 2. The suitability of these neural approxi- regulation in our loss function [31]. mations will be proven later in the paper. Originally developed for object detection tasks in which there is a strong imbalance between foreground and back- Additionally approaches To improve the performance of ground classes, we adopt focal regulation to account for sparse our method, we further investigated domain-specific construc- and underrepresented labels when learning the compatibility tions for multi-label classification. In particular, we tested function. This focal loss is mainly characterized by a mod- label embedding techniques. Those are approaches that aim ulating factor γ that dynamically reduces the importance of at modeling, implicitly, the correlation among labels. How- well-classified instances in the computation of the loss func- ever, although unconditional dependence is evident in the tion, allowing the model to focus on hard examples (e.g., un- modeled domain, we found no concrete advantage in directly derrepresented rules). More formally, the form of regularized considering it during the training. In the same direction, we binary cross entropy that we adopt is defined as: investigated more sophisticated embedding techniques, where ( labels and dictionary-words were jointly mapped to the same −(1 − α)(1 − p j )γ log(p j ) if y j = 1 latent space [48], yet achieving similar performance. FL(p j , y j ) = γ , αp j log(1 − p j ) if y j = 0 Additionally, we tested implementations based on trans- former networks [44], obtaining no substantial improvement. where p j is the probability assigned by the model to the j’th We attribute such a result to the lack of dominant long-term class, and y j is the ground-truth label (i.e., 1/hit and 0/miss). relationships among characters composing dictionary-words. The parameter α in the equation allows us to declare an a pri- In such a domain, we believe convolutional filters to be fully ori importance factor to the negative class. We use that to capable of capturing characters’ interactions. Furthermore, down-weighting the correct predictions of the negative class convolutional layers are significantly more efficient than the in the loss function that would be dominant otherwise. In multi-head attention mechanism used by transformer net- our setup, we dynamically select α based on the distribu- works. tion of the hits observed in the training set. In particular, we choose α= (1−p̄ p̄) , where p̄ is the ratio of positive labels (i.e., hits/guesses) in the dataset. Differently, we fix γ=2 as 3.3 Adaptive Mangling Rules we found this value to be optimal in our experiments. As motivated in Section 3.2, each word in the dictionary inter- Summing up, our loss function is defined as: acts just with a limited number of mangling transformations that are conditionally defined by users’ composition habits. |R| While modern rules-sets can contain more than ten thousand L f = Ex,y ∑ FL(sigmoid( f (x) j ), y j ) j=1 entries, each dictionary-word w will interact only with a small subset of compatible rules, say Rw . As stated before, opti- where f are the logits of the neural network. We train the mized configurations compose over pairs of dictionaries and model using Adam stochastic gradient descent [29] until an rule-sets that have been created to mutually support each early-stopping-criteria based on the AUC of a validation set other. This is achieved by implicitly maximizing the aver- is reached. age cardinality of the compatible set of rules Rw for each Maintaining the same general architecture, we train dif- dictionary-word w in the dictionary. ferent networks with different sizes. In our experiments, we I doing so, advanced attackers rely on domain knowledge noticed that large networks provide a better approximation of and intuition to create optimized configurations. But, thanks the compatibility function, although small networks can be to the explicit form of the compatibility function, it is possi- used to reduce the computational cost with a limited loss in ble to simulate their expertise. The intuition is that, given a utility. In the paper, we report the results only for our biggest dictionary-word w, we can infer the compatible rules-set Rw networks. (i.e., the set of rules that interact well with w) according to the 7
adaptive standard 0.3 0.4 0.8 0.7 0.6 0.3 0.3 Guessed passwords Guessed passwords Guessed passwords Guessed passwords 0.6 0.2 0.5 0.5 0.2 0.2 Guessed passwords 0.4 0.3 0.1 0.2 0.4 0.1 0.1 0.1 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.30.5 1.0 1.5 2.0 2.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0 1 2 3 4 Number of Guesses ×1011 Number of Guesses ×1010 Number of Guesses ×1010 Number of Guesses ×1010 0.2 (a) MyHeritage on animoto (b) animoto on MyHeritage (c) animoto on RockYou (d) RockYou on animoto 0.1 Figure 2: Comparison between adaptive and 0.0classic mangling rules on four combination password leaks (dictionary/attacked-set) using the rules-set PasswordPro. β=0.5 is used 0.0for the adaptive 0.5 case. 1.0 1.5 Number of Guesses ×1012 compatibility scores assigned by the neural approximation of magnitude smaller than the complete rules-set. Typically, for π. More formally, given π for the rules-set R and a dictionary- β=0.5, only ∼ 10%/15% of the rules are conditionally ap- word w, we can determine the compatible rules-set for w by plied to the dictionary-words. Considering the percentage of thresholding the compatibility values assigned by the neural guessed passwords for adaptive and non-adaptive attacks, this network to the rules in R: means that approximately 90% of guesses are wasted during classic, unoptimized mangling rules attacks. Figure 3 further Rw ≈ Rβw = {r | r ∈ R ∧ π(w, r) > (1 − β)}, (4) reports the distribution of selected rules during the adaptive attack of Figure 2a. It emphasizes how mangling rules hetero- where β ∈ (0, 1] is a threshold parameter whose effect will be geneously interact with the underlying dictionary. Although discussed later. very few rules interact well with all the words (e.g., selection At this point, we simulate high-quality configuration at- frequency is > 70%), most of the mangling rules participate tacks by ensuring dictionary-words does not interact with β only in rare events. rules outside its compatible rules-set Rw . Algorithm 1 imple- Further empirical validation for the adaptive mangling rules ments this strategy by following a word-major order in the will be given later in Section 5. generation of guesses. Every dictionary-word is limited to β 1.0 interact with the subset of compatible rules Rw that is decided 0.9 by the neural net. Intuitively, this is equivalent to assigning 0.8 0.7 Selection ratio and applying a dedicated (and possibly unique) rules-set 0.6 to each word in the dictionary. Note that, the selection of 0.5 0.4 the compatible rules-set is performed at runtime, during the 0.3 0.2 attack, and does not require any pre-computation. We call this 0.1 0.0 novel guessing strategy Adaptive Mangling Rules, since the Rules rule-set is continuously adapted during the attack to better assist the selected dictionary. Figure 3: Selection frequencies of adaptive mangling rules The efficacy of adaptive mangling rules over the standard for the 3120 rules of PasswordPro. attack is shown in Figure 2, where multiple examples are reported. The adaptive mangling rules reduce the number of produced guesses while maintaining the hits count mostly The Attack Budget Unlike standard dictionary attacks, unchanged. In our experiments, the adaptive approach in- whose effectiveness solely depends on the initial configu- duces compatible rules-sets that, on average, are an order of ration, adaptive mangling rules can be controlled by an addi- tional scalar parameter that we refer to as the attack budget β. This parameter defines the threshold of compatibility that a Algorithm 1: Adaptive mangling rules attack. β rule must exceed to be included in the rules-set Rw for a word Data: dictonary D, rules-set R, budget β, neural net πR w. Indirectly, this value determines the average size of compat- 1 forall w ∈ D do ible rules-sets, and consequently, the total number of guesses β 2 Rw = {r|πR (w)r > (1 − β)}; performed during the attack. More precisely, low values of 3 forall r ∈ Rw do β β force compatible rule-sets to include only rules with high- 4 g = r(w); compatibility. Those will produce only a limited number of 5 issue g; guesses, inducing very precise attacks at the cost of missing possible hits (i.e., high precision, low recall). Higher values 8
adaptive(β = 0.4) adaptive(β = 0.5) adaptive(β = 0.6) adaptive(β = 0.7) 1.0 1.0 1.0 1.0 1.0 number of relative HITS number of relative HITS number of relative HITS number of relative HITS number of relative HITS 0.9 0.9 0.9 0.9 0.8 0.8 0.9 0.8 0.8 0.7 0.7 0.7 0.7 0.8 0.6 0.6 0.6 0.6 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 number of relative GUESSES number of relative GUESSES number of relative GUESSES number of relative GUESSES 0.7 (a) MyHeritage on animoto (b) animoto on MyHeritage (c) animoto on RockYou (d) RockYou on animoto 0.6 Figure 4: Effect of the parameter β on combinations of password sets and Pass-the0.0 0.1 guessing performance 0.2four different for 0.3 number of relative GUESSES wordPro rules. Plots are normalized according to the results of the standard mangling rules attack (i.e., β = 1). For instance, (x=0.1, y=0.95) means that we guessed 95% of the password guessed with the standard mangling rules attack by performing 10% of the guesses required from the latter. of β translate in a more permissive selection, where also rules Table 2: Number of compatible scores computed per second with low-compatibility are included in the compatible set. (c/s) for different networks. Values computed on a single Those will increase the number of produced guesses, induc- NVIDIA V100 GPU. ing more exhaustive, yet more imprecise, attacks (i.e., higher recall, lower precision). When β reaches 1, the adaptive man- generated2 generated PasswordPro gling rules attack becomes a standard mangling rules attack, (large) (large) (large) since all the rules are unconditionally included in the com- 130.550.403 c/s 89.049.382 c/s 31.836.734 c/s patible rules-set. The effect of the budget parameter is better captured by the examples reported in Figure 4. Here, the performance of multiple values of β is visualized and com- do not change this feature. pared with the total hits and guesses performed by a standard In Algorithm 1, the only additional operation over the stan- mangling rules attack. dard mangling rules attack is the selection of compatible rules The budget parameter β can be used to model different for each dictionary-word via the trained neural net. As dis- types of adversaries. For instance, rational attackers [17] cussed in Section 3.2.1, this operation requires just a single change their configuration in consideration of the practical network inference to be computed; that is, with a single in- cost of performing the attack. This parameter permit to easily ference, we obtain a compatibility score for each element describe those attackers and evaluate password security ac- in {w} × R. Furthermore, inference for multiple consecutive cordingly. For instance, using a low budget (e.g., β=0.4), we words can be trivially batched and computed in parallel, fur- can model a greedy attacker who selects an attack configura- ther reducing the computation’s impact. tion that maximizes guessing precision at the expense of the Table 2 reports the number of compatibility values that dif- number of compromised accounts (a rational behavior in case ferent neural networks can compute per second. In the table, of an expensive hash function). we used our largest networks without any form of optimiza- Seeking a more pragmatic interpretation, the budget param- tion. Nevertheless, the overhead over the plain mangling rules eter is implicitly equivalent to early-stopping4 (i.e., Eq. 1), attack is minimal (see Appendix G). Additionally, similar to where single guesses are sorted in optimal order i.e., guesses standard dictionary attacks, adaptive mangling rules attacks are exhaustively generated before the attack, and indirectly are inherently parallel and, therefore, distributed and scalable. sorted by decreasing probability/compatibility. The optimal value of β depends on the rules-set. In our tests, we found these optimal values to be 0.6, 0.8 and 0.8 4 Dynamic Dictionary attacks for PassowordPro, generated and generated2, respectively. Hereafter, we use these setups, unless otherwise specified. This section introduces the second and last component of our password model—a dynamic mechanism that systematically adapts the guessing configuration to the unknown attacked- Computational cost One of the core advantages of dictio- set. In Section 4.1, we introduce the Dynamic Dictionary nary attacks over more sophisticated approaches [34, 36, 45] Augmentation technique. Next, in Section 4.2, we introduce is their speed. For mangling rules attacks, generating guesses the concept of a Dynamic Budgets. has almost a negligible impact. Despite being consistently more complex in their mechanisms, adaptive mangling rules Motivation: As widely documented [18, 21, 33, 37], pass- 4 The attack stops before the guesses are terminated. word composition habits slightly change from sub-population 9
to sub-population. Although passwords tend to follow the steph same general distribution, credentials created under different environments exhibit unique biases. Users within the same steph69 phpphp group usually choose passwords related to each other, influ- enced mostly by environmental factors or the underlying ap- phpphp00 php123 phpman plicative layer. Major factors, for example, are users’ mother tongue [21], community interests [47] and, imposed password composition policies [30]. These have a significant impact on php00 php1234 123php thephpman defining the final password distribution, and, consequently, the guessability of the passwords [28]. The same factors that php001 php007 phper php12345 thephp shape a password distribution are generally available to the attackers who can collect and use them to drastically improve the configuration of their guessing attacks. phper123 php123456 p12345 s12345 Unfortunately, current automatic reactive/proactive guess- ing techniques fail to describe this natural adversarial behav- p123456 s123456 ior [28, 32, 33, 43, 46]. Those methods are based on static configurations that apply the same guessing strategy to each Figure 5: Example of small hits-tree induced by the dynamic attacked-set of password, mostly ignoring trivial information attack performed on the phpBB leak. In the tree, every vertex that can be either a priori collected or distilled from the run- is a guessed password; an edge between two nodes indicates ning attack. In this Section, we discuss suitable modifications that the child password has been guessed by applying a man- of the mangling-rules framework to describe a more realis- gling rule to the parent password. tic guessing strategy. In particular, avoiding the necessity of any prior knowledge over the attacked-set, we rely on the concept of dynamic attack [37]. Here, a dynamic attacker can extend for thousands of levels.6 is an adversary who changes his guessing strategy according Figure 5 depicts an extremely small subtree (“hits-tree") ob- to the attack’s success rate. Successful guesses are used to tained by attacking the password leak phpBB. The tree starts select future attempts with the goal of exploiting the non- when the word “steph” is mangled, incidentally producing i.i.d. of passwords originated from the same environment. In the word “phpphp”. Since the latter lies in a dense zone of other words, dynamic password guessing attacks automati- the attacked set (i.e., it is a common users’ practice to insert cally collect information on the target password distribution the name of the website or related strings in their password), and use it to forge unique guessing configurations for the it induces multiple hits and causes the attack to focus in that same set during the attack. Similarly, this general guessing specific zone of the key-space. The focus of the attack grows approach can be easily linked to the optimal guessing strategy exponentially hit after hit and automatically stops only when harnessed from human experts in [43], where mangling rules no more passwords are matched. Eventually, this process were created at execution time based on the initially guessed makes it possible to guess passwords that would be missed passwords. with the static approach. For instance, in Figure 5, all the nodes in bold are passwords matched by the dynamic attack but missed by the static one (i.e., standard dictionary attack) 4.1 Dynamic Dictionary Augmentation under the same configuration. In [37], dynamic adaptation of the guessing strategy is ob- Figure 6 compares the guessing performance of the dy- tained from password latent space manipulations of deep gen- namic attack against the static version on a few examples for erative models. A similar effect is reproduced within our the PasswordPro rules-set. The plots show that the dynamic mangling rules approach by relying on a consistently simpler, augmentation of the dictionary has a very heterogeneous ef- yet powerful, solution based on hit-recycling. That is, every fect on the guessing attacks. In the case of Figure 6a, the time we guess a new password by applying a mangling rule dynamic attack produces a substantial increment in the num- over a dictionary word, we insert the guessed password in the ber of guesses as well as in the number of hits i.e., from dictionary at runtime. In practice, we dynamically augment ∼ 15% to ∼ 80% recovered passwords. Arguably, such a gap the dictionary during the attack using the guessed pass- is due to the minimal size of the original dictionary phpBB. words.5 In the process, every new hit is directly reconsidered In the attack of Figure 6b, instead, a similar improvement is and semantically extended through mangling rules. This re- achieved by requiring only a small number of guesses. On the cursive method brings about massive chains/trees of hits that other hand, in the attack depicted in Figure 6c, the dynamic augmentation has a limited effect on the final hits number. 5 Although we have not found any direct reference to the hits-recycling technique in the literature, it is likely well known and routinely deployed by 6 I.e., a forest, where the root of each tree is a word from the original professionals. dictionary. 10
dynamic standard 0.8 0.5 0.8 0.8 0.7 0.6 0.7 0.6 0.7 0.4 0.3 Guessed passwords Guessed passwords Guessed passwords Guessed passwords Guessed passwords 0.6 0.6 0.5 0.4 0.5 0.5 0.5 0.3 0.2 Guessed passwords 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.4 0.2 0.1 0.1 0.1 0.1 0.1 0.0 0.0 0.5 1.0 1.5 2.0 0.0 0 1 2 3 4 5 0.3 6 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0 1 2 3 4 0.0 0 2 4 6 Number of Guesses ×1010 Number of Guesses ×1010 Number of Guesses ×1011 Number of Guesses ×1010 Number of Guesses ×1010 0.2 (a) phpBB on animoto (b) RockYou on animoto (c) MyHeritage on animoto (d) animoto on RockYou (e) animoto on MyHeritage 0.1 Figure 6: Performance comparison between0.0dynamic and classic (static) attack for five different setups of dictionary/attacked-set. The rules set PasswordPro in non-adaptive mode0.0 is used 1.5 0.5 in all1.0the reported attacks. The 5 setups have been handpicked to fully Number of Guesses ×10 12 represent the possible effects of the dynamic dictionary augmentation. phpBB RockYou MyHeritage 4.2 Dynamic budgets 0.9 1.0 0.9 0.8 0.8 Adaptive mangling rules (Section 3.3) demonstrated that it is %Guessed Passwords 0.7 %Guessed Passwords 0.7 0.6 0.8 0.6 0.5 0.5 possible to consistently improve the precision of the guessing 0.4 0.6 0.4 attack by promoting compatibility among rules-set and dictio- 0.3 0.3 0.2 0.2 nary (i.e., simulating high-quality configurations at runtime). 0.4 0.1 0.1 108 109 0.2 1010 1011 1012 108 109 1010 1011 1012 This approach assumes that the compatibility function mod- Number of guesses (log) Number of guesses (log) eled before the attack is sufficiently general to simulate good (a) standard attack 0.0 0.0 0.2 0.4 0.6 (b) dynamic attack 0.8 1.0 configurations for each possible attacked-set. However, as motivated in the introduction of Section 4, every attacked set Figure 7: Guessing attacks performed on the animoto leak of passwords present peculiar biases and, therefore, different using three different dictionaries. The panel on the left reports compatibility relations among rules and dictionary-words. the guessing curves for the static setup. The panel on the right To reduce the effect of this dependence, we introduce an ad- reports those for the dynamic setup. The x-axis is logarithmic. ditional dynamic approach supporting the adaptive mangling rules framework. Rather than modifying the neural network at runtime (which is neither a practical nor a reliable solution), we alter the selection process of compatible rules by acting on the budget parameter β. However, it increases the attack precision in the initial phase. Conversely, attacks in Figures 6d and 6e show a decreased Algorithm 2 details our solution. Here, rather than having a precision in the initial phase of the attack, but that is com- global parameter β for all the rules of the rules-set R, we have pensated later by the dynamic approach. The same results a budget vector B that assigns a dedicated budget value to each are reported in Appendix F for the rules-sets generated and rule in R (i.e., B ∈ (0, 1]|R| ). Initially, all the budget values in generated2. B are initialized to the same value β (i.e., ∀r∈R Br =β) given as an input parameter. During the attack, the elements of B Another interesting property of the dynamic augmentation are individually increased and decreased to better describe is that it makes the guessing attack consistently less sensitive the attacked set of passwords. Within this context, increasing to the choice of the input dictionary. Indeed, in contrast with the budget Br of a rule r means reducing the compatibility the static approach, different choices of the initial dictionary tend to produce very homogeneous results in the dynamic ap- proach. This behavior is captured in Figure 7, where results, Algorithm 2: Adaptive rules with Dynamic budget obtained by varying three input dictionaries, are compared Data: dictonary D, rules-set R, attacked-set X, budget β between static and dynamic attack. The standard attacks (Fig- 1 forall w ∈ D do ure 7a) result in very different outcomes; for instance, using 2 β Rw = {r|πR (w)r > (1 − Bi )}; phpBB we match 15% of the attacked-set, whereas we match β more than 80% with MyHeritage. These differences in per- 3 forall r ∈ Rw do formance are leveled out by the dynamic augmentation of the 4 g = r(w); dictionary (Figure 7b); all the dynamic attacks recover ∼ 80% 5 if g ∈ X then of the attacked-set. Intuitively, dynamic augmentation reme- 6 X = X − {x}; dies deficiencies in the initial configuration of the dictionary, 7 Br = Br + ∆; |B| promoting its completeness. These claims will find further 8 B = B · ∑|B| β ; ∑ B support in Section 5. 11
threshold needed to include r in the compatible rules-set of a implementation of AdaMs are given in Appendix H, whereas dictionary-word w, and, consequently, making r more popular we benchmark it in Appendix G. during the attack. On the other hand, by decreasing Br , we reduce the chances of selection for r; r is selected only in case of high-compatibility words. 5.1 Evaluation In the algorithm, we increase the budget Br when the rule r Figure 8 reports an extensive comparison of AdaMs against produces a hit . The added increment is a small value ∆ that standard mangling-rules attacks. In the figure, we test all pairs scales inversely with the number of guesses produced. of dictionary/rule-set obtained from the combination of the At the end of the internal loop, the vector B is then nor- dictionaries: MyHeritage, RockYou, animoto, phpBB and the |R| malized; i.e., we scale the values in B so that ∑Rr B = ∑i β. rules-sets: PasswordPro and generated on four attacked-sets. Normalizing B has two aims. (1) It reduces the budgets for Results for generated2 are reported in Appendix F instead. non-hitting rules (the mass we add to the budget of rule r is Hereafter, we switch to a logarithm scale given the hetero- subtracted from all other budgets.). (2) It maintains the total geneity of the number of guesses produced by the various |R| budget of the attack (i.e., ∑i β) unchanged so that dynamic configurations. and static budget leads to almost the same number of guesses For the reasons given in the previous sections, AdaMs outper- during the attack for a given β. Furthermore, we impose a forms standard mangling rules within the same configurations, maximum and a minimum bound on the increments or decre- while requiring fewer guesses on average. More interestingly, ments of B. This is to prevent values of zero (rule always AdaMs attacks generally exceed the hits count of all the stan- excluded) or equal/higher than one (rule always included). dard attacks regardless of the selected dictionary. In particular, As for the dynamic dictionary augmentation, the dynamic this is always true for the generated rules-set. budget has always a positive, but, heterogeneous, effect on the Conversely, in cases where the dynamic dictionary augmenta- guessing performance. Mostly, the number of hits increases tion offers only a small gain in the number of hits (e.g., attack- or remains unaffected. Among the proposed techniques, this ing RockYou), AdaMs equalizes the performance of various is the one with the mildest effect. Yet, this will be particularly dictionaries, typically, towards the optimal configuration for useful when combined with dynamic dictionary augmenta- the standard attack. In Figures 8d and 8h, all the configura- tion in the next Section. Appendix E better explicates the tions of AdaMs reach a number of hits comparable to the best improvement induced from the dynamic budgets. configuration for the standard attack, i.e., using MyHeritage, while requiring up to an order of magnitude fewer guesses (e.g., Figure 8d), further confirming that the best standard 5 Adaptive, Dynamic Mangling rules: AdaMs attack is far from being optimal. In the reported experiments, the only outlier is phpBB when used against zooks in Fig- The results of the previous section confirm the effectiveness ure 8b. Here, AdaMs did not reach/exceed all the standard of the dynamic guessing mechanisms. We increased the num- attacks in the number of hits despite consistently redressing ber of hits compared to classic dictionary attacks by using the the initial configuration. However, this discrepancy is can- produced guesses to improve the attack on the fly. However, celed out when more mangling rules are considered i.e., in in the process, we also increased the number of guesses, pos- Figure 8f. sibly in a way that is hard to control and gauge. Moreover, by Eventually, the AdaMs attack makes the initial selection of changing the dictionary at runtime, we disrupt any form of the dictionary systematically less influential. For instance, in optimization of the initial configuration, such as any a priori our experiments, a set such as phpBB reaches the same per- ordering in the wordlist [32] and any joint optimization with formance of wordlists that are two orders of magnitude larger the rules-set7 . Unavoidably, this leads to sub-optimal attacks (e.g., RockYou). The crucial factor remains the rules-set’s that may overestimate passwords strength. cardinality that ultimately determines the magnitude of the To mitigate this phenomenon, we combine the dynamic aug- attack, even though it does not appreciably affect the guessing mentation technique with the Adaptive Mangling Rules frame- performance. work. The latter seeks an optimal configuration at runtime The effectiveness of AdaMs is better captured by the re- on the dynamic dictionary, promoting compatibility with the sults reported in Figure 9. Here, we create a synthetic optimal rules-set and limiting the impact of imperfect dictionary- dictionary for an attacked-set and evaluate the capability of words. This process is further supported by the dynamic AdaMs to converge to the performance of such an optimal budgets that address possible covariate-shifts [42] of the com- configuration. To this end, given a password leak X, we ran- patibility function induced by the augmented dictionary. domly divide it in two disjointed sets of equal size, say Xdict Hereafter, we refer to this final guessing strategy as and Xtarget . Then, we attack Xtarget by using both Xdict (i.e., op- AdaMs (Adaptive, Dynamic Mangling rules). Details on the timal dictionary) and an external dictionary (i.e., sub-optimal dictionary). Arguably, Xdict is the a priori optimal dictionary 7 I.e., new words may not interact well with the mangling rules in use. to attack Xtarget since Xdict and Xtarget are samples of the very 12
You can also read