http://qs1969.pair.com?node_id=43177

Leitchn has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a perl script to brute force a substitution cipher.

Before I can start though, I need letter freqency tables. Easy enough, you might say, but I need ones that contain uppercase letters, lowercase letters, and punctuation!

Does anyone have any idea where such may be found?

TIA.

Replies are listed 'Best First'.
(Ovid) Re: Letter frequencies
by Ovid (Cardinal) on Nov 24, 2000 at 03:51 UTC
    I don't know of any such resource, but perhaps writing a perl script to do figure that out would be the way to go? Have it read plenty of text in your target language and calculate the frequency of each symbol.

    You'd have to ensure that you're reading plaintext, though, and not markup or something like that. The following would populate %symbol with a frequency count. You'd just pass it a list of files on the command line. What you'd do with the data from there would be up to you.

    while ($line = <>) { $symbol{ $_ }++ for ( split //, $line ); }

    Cheers,
    Ovid

    Update: mdillon had a good point. Here's a rewrite:

    while ($line = <>) { for ( split //, $line ) { $symbol{ $_ }++; $total++; } }
    Or, the fun method: use my first code and add the following after the loop:
    $total = eval (join '+', values %symbol); # :)

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      i would add a running total of all characters seen as well, to facilitate calculating the actual frequencies, not just the counts.

      update: what MJD said (infra).

        mdillon says:
        > i would add a running total of all characters seen as well
        While I agree with you that such a total would be useful, it's clearly much cheaper to compute it at the end:
        for my $symbol (keys %symbol) { $total += $symbol{$symbol}; }
Re: Letter frequencies
by Dominus (Parson) on Nov 24, 2000 at 03:53 UTC
    The way people make letter frequency tables is by taking a large sample of text and counting the number of occurrences of each letter.

    In the old days, this was tedious and time-consuming, so people used the same tables over and over.

    Nowadays, we have computers to do this for us. Just acquire a large sample of text and write a Perl program to count symbol frequencies. The Perl program will be about ten lines long. Then you can make sure that the sample text is similar the sorts of messages you are planning to decode, and you can arrange for the table to include exactly the items that you want it to.

    You can do even better by having it count trigraph frequencies (that's the frequency of a particular sequence of three characters, like har) and plugging that table into your brute-forcer instead.

      And here you go. I &heart programming excercises like this. Reads files, outputs a reverse sorted list based on # of occurences. Note that some trigraphs may not be totally valid because of the s/\W//g;. for example, "this is a line.This is another one" will yield "lineThis" as a word to be trigraphed. This is trivial to fix, though :)

      Update yes, it was.
      I changed
      s/ //g;
      to s/\W//g; thanks, Albannach! (wave)

      # # # use strict; my %symbol; my %tri; my @trikeys; my $line; my $ctr; print "Processing file...\n"; while ($line = <>) { for (split /\W/,$line) { #discard all non-alpha. be *greedy* s/\W//g; (length($_)>2) && $symbol{lc($_)}++; } } print "Collecting trigraphs...\n"; foreach (keys %symbol){ for ($ctr=0; $ctr <= (length($_)-3);$ctr++) { $tri{lc(substr ($_,$ctr,3))}+= $symbol{$_}; } } @trikeys = sort {$tri{$b} <=> $tri{$a}} keys %tri; print "Total Trigraphs : ",$#trikeys,"\n"; print "Trigraph\tCount\n"; foreach (@trikeys) { print "$_\t$tri{$_}\n"; }
        Cool. You have a tiny bug:
        print "Total Trigraphs : ",$#trikeys,"\n";
        This undercounts the trigraphs by 1.

        Also, the s/\W//g; line is not doing anything. All the \W characters have already been discarded by the split.

        I have a not-quite-brute-force decipherer, but it's not really what leitchn was looking for. It assumes that you still know where the word boundaries are, and then it uses a heuristically guided search based partly on letter frequency and partly on repeated letter patterns. For example, if it sees the ciphertext ABCDDEFGHIJA, it guesses that the word is either glassworking, stanniferous, or scaffoldings. (If it isn't one of these, it won't be able to solve the puzzle, because it doesn't know the words.)

        Hee's the program that generates the pattern dictionary:

        #!/usr/bin/perl @DICTS = </usr/dict/*>; # @DICTS = ('/usr/dict/words'); load_dictionary(@DICTS); { local $, = "\0"; while (($pat, $words) = each %words) { print $pat, @$words, "\n"; } } sub pattern { my ($w) = @_; my $n = 'A'; while (my ($l) = $w =~ /([a-z])/) { $w =~ s/$l/$n/g; $n++; } $w; } sub load_dictionary { my $s = time; local @ARGV = @_; while (<>) { chomp; next unless /^[a-z]*$/; next if $is_word{$_}++; push @{$words{pattern($_)}}, $_; } continue { my $n = keys %is_word; print STDERR "$n words loaded.\n" if $n % 10000 == 0 && $n > 0; } my $e = time - $s; print STDERR "Elapsed time to load dictionary: $e.\n"; }
        I'd really like to find a faster way to do the pattern() function.

Re: Letter frequencies
by dws (Chancellor) on Nov 24, 2000 at 10:20 UTC
    Many cryptanalysis texts will provide a sample frequency table. A good text will also advise that, for best advantage, a frequency table should be derived from a corpus (collection of documents) that are representative of the text you'll be using the frequency table to help decrypt. So, if you're going to mount a brute force attack on a simple substitution cipher, it'll help to derive your frequency table from a set of works that are representative of the text that's encrypted.
Re: Letter frequencies
by jeroenes (Priest) on Nov 24, 2000 at 14:54 UTC
    Just for fun, I want to post a script that does without the loops. So with map/ no while or for. What do you think?

    undef $/; #just read the whole STDIN in once: $_=<>; tr/ \t\n\r//d; #remove all spaces map{ $count{ $_ }++;} split //; #the actual count foreach $key (sort(keys %count)) #OK, one loop to print { print "$key\t$count{$key}\n"; $total += $count{$key}; } print "Total\t$total\n";
    Well, I tested it, and this is the tail of the output, run with time perl .... (too lazy for benchmarking):
    ó 18 ô 93 õ 12 ö 16 ÷ 12 ø 25 ù 2 ú 3 û 1 ü 2 ý 6 þ 2 ÿ 6 0.26user 0.05system 0:00.62elapsed 49%CPU (0avgtext+0avgdata 0maxresid +ent)k 0inputs+0outputs (275major+797minor)pagefaults 0swaps [jeroen@rulffk] time: 11:49:31 1 jobs wd: ~ (12.011Mb) $ ls -l test.txt -rw-rw-r-- 1 jeroen jeroen 150271 Aug 17 11:40 test.txt [jeroen@rulffk] time: 11:50:04 1 jobs wd: ~ (12.011Mb)
    And it looks to me like a 150kb file that has been counted in just 0.62 s. Looks pretty impressive to me. I don't consider myself to be a decent perl coder, so I would appreciate any comments to this....

    Have fun,

    Jeroen

    I was dreaming of guitarnotes that would irritate an executive kind of guy (FZ)

    PP: I just took a .txt file, but it wasn't a text-file after all. Maybe some word (ugh!) document or whatever.

    Update Removed square brackets from tr///d, as pointed out by merlyn. And changed a funny typo (s/while/whole/). Justifies my remark on decent coding, I'm afraid :-<. Another update: added the total number.

      tr/[ \t\n\r]//d;
      Are you literally wanting to remove [ and ] there? Remember, the operands to the tr operator are not character classes, so they don't need brackets.

      -- Randal L. Schwartz, Perl hacker

        Stomp on the head Yeah, of course. My excuses. I still have to get used to tr// . At first I wanted to tr/\s//d, but that was quite a disappointment! ;-).

        Thanks a lot!
        Jeroen

        I was dreaming of guitarnotes that would irritate an executive kind of guy (FZ)

Re: Letter frequencies
by 2501 (Pilgrim) on Nov 24, 2000 at 10:51 UTC
    Take a look at the British National Corpus (BNC).
    I became aware of them after reading TPJ.

    http://sara.natcorp.ox.ac.uk/ will take you to the BNC Online area.
    good luck!