#throwback to a couple of semesters ago. I have an assignment that was pretty much free reign, it just had to be remotely security related, and I decided to write a perl script that solves mono-alphabetic substitution ciphers in most European languages (it’s on my GitHub). You know, the ones where each letter is replaced by another one.
The basic idea is that you have a key, which is just a permutation of the alphabet (keyed by a word in the pre-computing days), and so each plaintext letter is replaced by the corresponding letter in the key, to produce the cipher text. For example, if the key was BCDE… (for a simple example), the plaintext HELLO would become IFMMP. Easy right?
But now imagine that the key was just a random permutation, the keyspace is 26!, which is about 4×10^26 possible permutations. That’s a little too many to brute force, and would be a pretty boring solution anyway. We can’t use a dictionary attack, because we’re using random keys, instead of words (and filling the rest of the alphabet in order of what’s left), so where does that leave us?
Well, the cipher has a fatal flaw that we can exploit. We can tell how good a particular key is by looking at how close to English (or whatever language, easily transferable) the decryption is. This combined with the fact that the “Englishness” of our decryption tells us how close we are to the actual key (examples following) means I can put that algorithms class to good use.
So, let’s take a look at those ideas using our HELLO example. Firstly, we can tell how close to English a particular solution is by looking at it’s letter frequencies. For example, lets say a key we try gives us the result IFMMP, another gave us IFLLP and another gave us IEMMP. Now, (without knowing that we’re trying to get to HELLO) we can look at those solutions and say that a double LL is more common than an MM, and that an E is more common than an F in English. I’ll get more into how we can tell “Englishness” in a little bit. Now, we can see that the solutions that have a result closer to English are closer to the correct solution. The reason for that is because we can change one part of the solution without affecting (almost) any other part, so we can successfully iterate closer to the correct solution by changing the key slightly, only keeping changes that get us closer to the correct solution.
So, the way that we assess a solutions “Englishness,” or fitness, is frequency analysis. We just need a table on the frequency of letters in the English language. This lets us figure out how close to English our solution is by comparing the frequency of the letters in our solution to the frequency of letters in English. Now, we won’t just count all the letters in the solution and sort them, we’ll use the log probability of their frequencies and add all of those (negative) probabilities together, the more likely a solution, the better it is.
We’re using the log probabilities because it means we can use addition instead of multiplication (log laws), and because it smooths the very small numbers, while making sure the larger numbers don’t dominate our final fitness score. Now, how do we get this data? While I’d like to say I scraped corpuses and tabulated the results, I just downloaded the files from someone who already had. I found files of frequency counts for various n-grams (combinations of multiple letters, instead of just single letters) in various languages. All that was left to do was convert all of that data into log probabilities, and then it was ready to use. I’ll get to why I used n-grams instead of just monograms (and digrams, as in the example above) shortly.
Now, to the actual algorithm. It mostly just boils down to guessing. Targeted guessing, but guessing nonetheless. Firstly, we just generate a random solution and test it’s fitness. Then we swap two letters of the solution and reassess it’s fitness, if it’s better we keep it, otherwise we stick with our original. Then we just do that a few thousand times and viola. Well, not quite. But we have found a local maximum. So we can either take a look and finish it by hand if we need to, or we can just go again, with a different starting point, and hopefully find a different local maximum (and eventually the global maximum).
So, why use different n-grams? It just makes for a more accurate fitness. Any given English text (especially short texts) will have 1-gram frequencies that aren’t quite like English as a whole, and so if we’re sorting only by 1-grams we won’t find the correct solution. But if we’re looking at 2, 3, and 4-grams things get a lot more accurate. In fact, after some experimentation I even stopped looking at 1-grams, since the results we’re more accurate without them.
I can’t claim originality (is anything though?), I found the general concepts of the fitness assessment and the hill climbing algorithm in relation to this problem after some research, and put some of their other findings into the final program, such as discounting 1-grams. I also learned that while 2-grams and 3-grams were effective, 4-grams and 5-grams had diminishing returns in accuracy compared to the added processing time and memory requirements.
This was useful information when it came to optimizing my script. I kept using 4-grams as an option, to be used for short ciphertext, or when more accuracy was desired (instead of generating a higher number of starting points, or finishing by hand). Some other optimizations I implemented were breaking on a stale key (no progress after several hundred swaps, instead of finishing the few thousand required to be comfortable with a solution, since we were already at a local maximum), setting a strategic number of maximum swaps before stopping (based on some research as well as experimentation, and bearing in mind we could break early on a local maximum), setting a minimum log probability so that an unlikely n-gram (or non-existent one) didn’t halt progress (although that was more a viability addition than an optimization, but a strategic floor needed to be set to ensure accurate results, without affecting too many n-grams), and threads (although perls threads have a lot of overhead, so they should only be used when generating a lot starting points, such as with a short ciphertext).