Developer Notes

March 19th, 2019

So two years since updating this page, though there has been a lot of work in the background. Version 3.3 Beta is still the official release, and contains many of the features below such as incorporating JtR Markov into the base grammar. Currently I'm working on version 4.0 which is a whole new release of the PCFG code-base. I'm experiencing a major case of feature creep and at some point will need to label it "done" and release it, but for now you can track my progress in the "Projects" tab of GitHub for this repo. Major improvements with the PCFG grammar include multiword support, using OMEN vs JtR Markov, and L33t letter replacements. We'll see how much progress I make on the manager/cracker, but ideally I want to get multiple "gears" included to speed up cracking speed as cracking sessions grow longer, at the expense of making less precise guesses. That way it'll start out really slow but making very precise guesses, but as time goes on sacrifice some of that precision to make guesses much faster. The real exciting thing about release 4.0 though will be what it sets PCFGs up for in the future. I'm currently looking at getting native PCFG support into both Hashcat and JtR which will mean a full C port of the manager code. I'm also looking at making the grammar much more tailorable after it has been created/trained to aid in support of targeted cracking sessions.

May 30th, 2017

Well it's been a year since my last developer's note. Lots of credit goes to @m33x for using this tool and basically kicking me in the butt to continue to develop it. Currently I'm in the process of cleaning things up for a 3.3 Beta release. The main new feature will be integration of John the Ripper's Markov modeling into the grammar so that password structures not seen in the training list will still be tried. I'm really happy how this release is turning out and it's continuing to surprise me. I'm hopeful I can look into more advanced ways to incorporate Markov guess generation in 3.4. For example, I'd like to include it as a generic terminal in the base structures so you could end up with a 123markov123. Also it would be fun to try out OMEN instead of JtR's --Markov. Another wishlist item that I'm saving for a rainy day will be to start coding up a post-training/pre-cracking tool that will customize a grammar to your particular cracking session. For example, update dates found in passwords, update word probabilities (eg. swap out 'rockyou' with site or organization name), etc.

Another feature that will make it into the 3.3 release is probability smoothing, which as the name implies smooths out probabilities of items found in the training set. This helps with the speed at which guesses are generated. If you have been reading the previous developer's notes, I went with smoothing items based on how close in probability they are vs. implementing the Lloyd-Max algorithm.

June 14th, 2016

Currently cleaning up the code to get a Beta release ready to push into the Main branch. The core guess generator is running, though I still haven't added the ability to regenerate the priority queue yet. I have split off the whole "backup storage" process into it's own thread + class so more inventive options can be applied later while limiting the main performance impact. Also the "guess generation" functionality has been split into a different process from the "next pre-terminal" functionality so they can be done in parallel. Long story short, it's still missing some features and is slow, but I think it's at a point where it may be useful for other people.

May 18th, 2016

I've been messing around with regenerating the priority queue with a max probability threshold. The issue I'm running into is that I want to support any generic PCFG with the core grammar, but that makes it very difficult to implement regenerate with max with any sort of reasonable performance. If anyone has any ideas I'm all ears. In the meantime I'm debating if I want to leave this feature off or if I want to set it up to work with the specialized grammar in use right now. I'm not a fan of either one of those options.

May 10th, 2016

There currently are a number of features on my To-Do list for this project that I've variously been working and procrastinating on.

Before I dive into them, I figure I should mention Saranga Komanduri's dissertation which I'll be referencing a bit. He took the original PCFG code I wrote back in the day, completely re-wrote it and modified it for CMU's password guess-ability service. You can find his dissertation here.

Probability Smoothing:

Probability smoothing is a catch-all term that can mean different things. Usually when talking about it I'm focusing on expanding the number of guesses per rule that a grammar generates. Let me back up. With PCFGs I tend to throw around terms like pre-terminals and terminals, but what I'm really talking about are mangling rules and guesses. One strength of using PCFGs is they have the ability to model very targeted mangling rules. Rather than having a few hundred mangling rules like what you find in sets like KoreLogic's rules for JtR, a typical PCFG generated by this program can produce trillions of rules. While this allows a cracker to make very fine grained guesses, it does come at a cost. Generating, sorting, and processing all those rules comes with a huge overhead. Without any probability smoothing the current test grammars I've been working with tend to on average generate less than 10 guesses per mangling rule. This means it runs very slowly. Ideally I'd like it to generate more guesses per rule. The loss of precision in most cases is more than offset by the increase in guessing speed. To put it another way, if you have a training set of a million passwords, and the number '5732' shows up twice, and '6207' shows up once, should they be assigned different probabilities, (and thus require different mangling rules), when numbers like '1234' show up tens of thousands of times?

Now there are different ways of implementing probability smoothing. Komanduri's approach, (he called this problem quantization), was to use the Lloyd-Max algorithm. There's certainly benefits to that, the primary one is that it provides a hard limit to how many replacement groups are allowed for each non-terminal. I'm thinking of going with a much simpler approach which groups replacements that are within a set hard-limit probability of each other. My reasoning behind this is I'm more concerned about the degree of difference the replacements have from each other vs providing an upper limit on the total number of replacement groups. The nice thing though is both approaches are applied in the training program so regardless of which way you go, no changes have to be made to the password cracking program itself.

The other feature I tend to mundge in when talking about probability smoothing is the ability for the grammar to create replacements that were not seen in the training set. For example, even if the number '67932' was never seen in the training set, it would be nice to try it sometimes during a cracking session. There's many, many, different approaches to solving this. Some of them will involve changing the cracking program unfortunately. But I'm very excited about trying some of them out. For example, you can create a replacement that says "Run a Markov attack with threshold=X". In this way a PCFG attack becomes much like the Borg. It can assimilate other techniques and incorporate them into a larger cracking session.
Regenerating the Priority Queue with a max probability threshold

Komanduri's approach to generating the PCFG rules was to generate all the rules that fell under a min probability threshold, save them to disk, and then sort them later. This is very similar to Markov Mode in JtR. It has a ton of advantages when you are not making changes to your grammar. Hey it's not like you are using those terabyte hardrives for rainbow tables anymore! The problem is if you are making frequent changes to your grammar that's a lot of set-up time before you can start making guesses. Also it's hard to tell exactly how much disk space a given probability threshold will require until you start generating pre-terminals with it. This program takes a different approach of trying to generate the pre-terminals in priority order. The downside is that A) It's slower if you run the same grammar multiple times, and B) The memory/RAM requirements quickly become crazy for longer cracking sessions. One work-around I use is to have a dynamic minimum probability threshold so if the priority queue, (and related data-structures), grow too large then a threshold is applied so low probability pre-terminals aren't saved, and this threshold can become more restrictive over time. The problem is once the cracker catches up to that threshold, the cracking session is over. I need a way to then rebuild the priority queue with a max probability threshold, (so that previous guesses aren't repeated). That will not only allow me to apply more tight memory bounds, but also save/restart cracking sessions which is a requirement for making this tool actually usable. Unfortunately since I'm trying to support any type of PCFG, (vs a very narrow and well defined grammar like my original implementation), it gets tricky to use a method like those described in Komanduri's paper to walk the grammar. Still this is an important feature. As an added bonus, once the ability to walk the grammar with a min and max probability threshold is implemented it becomes fairly straight-forward to us it with a "store to disk and then sort" approach with the new grammar, or a pure "generate unsorted guesses like Markov mode" approach. The last option is particularly interesting because it opens up the opportunity to use PCFGs in GPU based crackers. Who says a PCFG approach has to be slow? ;p
Parallelize the rule generation code

This is a very parallelizable problem. I need to start making use of all those CPUs.
Test and improve the basic grammar

The whole idea behind the PCFG re-write was to make developing and testing new grammars to model human password creation habits easier. It would be nice to spend some time actually doing that vs working on the underlying tool itself. At the very least, there are a ton of improvements that can be applied to the training program.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developer Notes

March 19th, 2019

May 30th, 2017

June 14th, 2016

May 18th, 2016

May 10th, 2016

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally