Frequency Analysis

Introduction

The program shown at the bottom of the page on the arbitrary letter subsititution cipher shows a message being decoded by guessing the cipher alphabet one letter at a time. That was relatively simple to do because the word spacing was left in the cipher text. This meant that reasonable guesses at the shorter words would unlock letters that could be used for the longer words. This is much harder to do if the spaces are removed.

Frequency analysis is based on the known distribution of letters in the language of the message. The letter 'e' appears frequently in English. The letter that appears most frequently in the cipher text may well be the one that was substituted for the letter 'e' in the plain text message.

The following table and chart shows the relative frequencies of letters in English according to a handful of sources.

Relative Frequency Of Letters

In the following screenshot, you can see that we have an encrypted message with no word spacing.

Relative Frequency Of Letters

Analysis of the letters shows the following distribution of letters,

Relative Frequency Of Letters

This shows me that the letters 't' and 'u' appears most frequently. I'm going to make the 'u' an 'e' and the 't' a 't', just like the original graph. Now I have,

TXEDCAECJMYXAVKJTTXEFATTAMAZTXEDJEEAKTXEJCFSTCJCNBETTECYGFYSTSTGTSAKRSDXECYXAVYJMEYYJEEFESKEPERAPEPFNEGEYYSKETXERSDXECJBDXJFETAKEBETTECJTJTSMETXJTVJYCEBJTSWEBNYSMDBETAPAFERJGYETXEVACPYDJRSKEVJYBEZTSKTXERSDXECTELTTXSYMEJKTTXJTCEJYAKJFBEEGEYYEYJTTXEYXACTECVACPYVAGBPGKBARHBETTECYTXJTRAGBPFEGYEPZACTXEBAKEECVACPYTXSYSYMGRXXJCPECTAPASZTXEYDJREYJCECEMAWEPZCEIGEKRNJKJBNYSYSYFJYEPAKTXEHKAVKPSYTCSFGTSAKAZBETTECYSKTXEBJKEGJEEAZTXEMEYYJEETXEBETTECEJDDEJCYZCEIGEKTBNSKEKEBSYXTXEBETTECTXJTJDDEJCYMAYTZCEIGEKTBNSKTXERSDXECTELTMJNVEBBFETXEAKETXJTVJYYGFYTSTGTEPZACTXEBETTECESKTXEDBJSKTELTMEYYJEE

T-E at the start of the message suggests the word 'the'. I'm going to gamble on that here. The figure for 'H' is pretty close to the frequency of the letter 'X' in this message. I can also add a space where I think the word boundaries might go.

THE DCAECJMYHAVKJT THE FATTAMAZ THE DJEEAK THE JCFSTCJCNBETTECYGFYSTSTGTSAKRSDHECYHAVYJMEYYJEEFESKEPERAPEPFNEGEYYSKE THE RSDHECJBDHJFETAKEBETTECJTJTSMETHJTVJYCEBJTSWEBNYSMDBETAPAFERJGYE THE VACPYDJRSKEVJYBEZTSK THE RSDHECTELTTHSYMEJKTTHJTCEJYAKJFBEEGEYYEYJT THE YHACTECVACPYVAGBPGKBARHBETTECYTHJTRAGBPFEGYEPZAC THE BAKEECVACPYTHSYSYMGRHHJCPECTAPASZ THE YDJREYJCECEMAWEPZCEIGEKRNJKJBNYSYSYFJYEPAK THE HKAVKPSYTCSFGTSAKAZBETTECYSK THE BJKEGJEEAZ THE MEYYJEE THE BETTECEJDDEJCYZCEIGEKTBNSKEKEBSYH THE BETTECTHJTJDDEJCYMAYTZCEIGEKTBNSK THE RSDHECTELTMJNVEBBFE THE AKETHJTVJYYGFYTSTGTEPZAC THE BETTECESK THE DBJSKTELTMEYYJEE

At this point, I want to work out the positions of the next most frequent letter, 'A'. 'J' and 'Y' are the candidates from my analysis. I'm going with 'J' to give me,

THE DCAECAMYHAVK AT THE FATTAMAZ THE DAEEAK THE ACFSTCACNBETTECYGFYSTSTGTSAKRSDHECYHAVYAMEYYAEEFESKEPERAPEPFNEGEYYSKE THE RSDHECABDHAFETAKEBETTECATATSME THAT VAYCEBATSWEBNYSMDBETAPAFERAGYE THE VACPYDARSKEVAYBEZTSK THE RSDHECTELTTHSYMEAKT THAT CEAYAKAFBEEGEYYEYAT THE YHACTECVACPYVAGBPGKBARHBETTECY THAT RAGBPFEGYEPZAC THE BAKEECVACPYTHSYSYMGRHHACPECTAPASZ THE YDAREYACECEMAWEPZCEIGEKRNAKABNYSYSYFAYEPAK THE HKAVKPSYTCSFGTSAKAZBETTECYSK THE BAKEGAEEAZ THE MEYYAEE THE BETTECEADDEACYZCEIGEKTBNSKEKEBSYH THE BETTEC THAT ADDEACYMAYTZCEIGEKTBNSK THE RSDHECTELTMANVEBBFE THE AKE THAT VAYYGFYTSTGTEPZAC THE BETTECESK THE DBASKTELTMEYYAEE

Reasonable guesses at word boundaries are helping here. The word BETTEC is worth having a guess at - it appears elsewhere. I'm going to suggest that it is the word 'letter' and try out my theory.

THE DRAERAMYHAVKAT THE FATTAMAZ THE DAEEAK THE ARFSTRARN LETTER YGFYSTSTGTSAKRSDHERYHAVYAMEYYAEEFESKEPERAPEPFNEGEYYSKE THE RSDHERALDHAFETAKE LETTER ATATSME THAT VAYRELATSWELNYSMDLETAPAFERAGYE THE VARPYDARSKEVAYLEZTSK THE RSDHERTELTTHSYMEAKT THAT REAYAKAFLEEGEYYEYAT THE YHARTERVARPYVAGLPGKLARH LETTER Y THAT RAGLPFEGYEPZAR THE LAKEERVARPYTHSYSYMGRHHARPERTAPASZ THE YDAREYAREREMAWEPZREIGEKRNAKALNYSYSYFAYEPAK THE HKAVKPSYTRSFGTSAKAZ LETTER YSK THE LAKEGAEEAZ THE MEYYAEE THE LETTER EADDEARYZREIGEKTLNSKEKELSYH THE LETTER THAT ADDEARYMAYTZREIGEKTLNSK THE RSDHERTELTMANVELLFE THE AKE THAT VAYYGFYTSTGTEPZAR THE LETTER ESK THE DLASKTELTMEYYAEE

A one-letter word has emerged in the phrase 'LETTER Y THAT'. I've aleady found the 'A'. It could be an 'I' but that doesn't make a lot of sense. Checking my frequency table, my numbers suggest that the 'Y' could be an 'N' or an 'S' or an 'O'. The only thing that makes sense here is that the 'S' indicates a plural and I need to adjust the word boundaries. Adding that and we get,

THE DRAERAMSHAVK AT THE FATTAMAZ THE DAEEAK THE ARFSTRARN LETTERS GFSSTSTGTSAKRSDHERSHAVSAMESSAEEFESKEPERAPEPFNEGESSSKE THE RSDHERALDHAFETAKE LETTER ATATSME THAT VASRELATSWELNSSMDLETAPAFERAGSE THE VARPSDARSKEVASLEZTSK THE RSDHERTELTTHSSMEAKT THAT REASAKAFLEEGESSESAT THE SHARTERVARPSVAGLPGKLARH LETTERS THAT RAGLPFEGSEPZAR THE LAKEERVARPSTHSSSSMGRHHARPERTAPASZ THE SDARESAREREMAWEPZREIGEKRNAKALNSSSSSFASEPAK THE HKAVKPSSTRSFGTSAKAZ LETTERS SK THE LAKEGAEEAZ THE MESSAEE THE LETTER EADDEARSZREIGEKTLNSKEKELSSH THE LETTER THAT ADDEARSMASTZREIGEKTLNSK THE RSDHERTELTMANVELLFE THE AKE THAT VASSGFSTSTGTEPZAR THE LETTER ESK THE DLASKTELTMESSAEE

Getting closer. At the end of the message, we have the string 'MESSAEE'. This looks a lot like the word 'message'. Trying this out gives us,

THE DRAGRAMSHAVKAT THE FATTAMAZ THE DAGEAK THE ARFSTRARN LETTER SGFSSTSTGTSAKRSDHERSHAVSA MESSAGEFESKGPERAPEPFNGGESSSKG THE RSDHERALDHAFETAKE LETTER AT ATSME THAT VASRELATSWELNSSMDLETAPAFERAGSE THE VARPSDARSKGVASLEZTSK THE RSDHERTELTTHSSMEAKT THAT REASAKAFLE GGESSES AT THE SHARTER VARPSVAGLPGKLARH LETTERS THAT RAGLPFEGSEPZAR THE LAKGER VARPSTHSSSSMGRHHARPERTAPASZ THE SDARESAREREMAWEPZREIGEKRNAKALNSSSSSFASEPAK THE HKAVKPSSTRSFGTSAKAZ LETTER SSK THE LAKGGAGEAZ THE MESSAGE THE LETTER EADDEARSZREIGEKTLNSKEKGLSSH THE LETTER THAT ADDEARSMASTZREIGEKTLNSK THE RSDHERTELTMANVELLFE THE AKE THAT VASSGFSTSTGTEPZAR THE LETTER ESK THE DLASKTELT MESSAGE

From here on in, it's not too tricky to guess at letters from the context of the message. Frequency analysis supports the reasoning and offers you two or three letters to make reasonable guesses from.

Programming Frequency Analysis

The counting part is pretty easy. A wee chart would be a nice addition (see below)

INPUT message
upper ← UPPERCASE(message)
DECLARE letterCount[25]
total ← 0
FOR letter ← 0 TO upper.LENGTH - 1
   tmpASC ← ASCII CODE OF upper[letter]
   IF tmpASC >= 65 AND tmpASC <= 90 THEN
      letterCount[tmpASC-65] ← letterCount[tmpASC-65] + 1
      total ← total + 1
   END IF
END FOR
FOR count ← 0 TO 25
   OUTPUT letterCount[count]/total
END FOR

Relative Frequency Of Letters