in reply to Letter frequencies

The way people make letter frequency tables is by taking a large sample of text and counting the number of occurrences of each letter.

In the old days, this was tedious and time-consuming, so people used the same tables over and over.

Nowadays, we have computers to do this for us. Just acquire a large sample of text and write a Perl program to count symbol frequencies. The Perl program will be about ten lines long. Then you can make sure that the sample text is similar the sorts of messages you are planning to decode, and you can arrange for the table to include exactly the items that you want it to.

You can do even better by having it count trigraph frequencies (that's the frequency of a particular sequence of three characters, like har) and plugging that table into your brute-forcer instead.

Replies are listed 'Best First'.
Re: Re: Letter frequencies
by boo_radley (Parson) on Nov 24, 2000 at 08:57 UTC
    And here you go. I &heart programming excercises like this. Reads files, outputs a reverse sorted list based on # of occurences. Note that some trigraphs may not be totally valid because of the s/\W//g;. for example, "this is a line.This is another one" will yield "lineThis" as a word to be trigraphed. This is trivial to fix, though :)

    Update yes, it was.
    I changed
    s/ //g;
    to s/\W//g; thanks, Albannach! (wave)

    # # # use strict; my %symbol; my %tri; my @trikeys; my $line; my $ctr; print "Processing file...\n"; while ($line = <>) { for (split /\W/,$line) { #discard all non-alpha. be *greedy* s/\W//g; (length($_)>2) && $symbol{lc($_)}++; } } print "Collecting trigraphs...\n"; foreach (keys %symbol){ for ($ctr=0; $ctr <= (length($_)-3);$ctr++) { $tri{lc(substr ($_,$ctr,3))}+= $symbol{$_}; } } @trikeys = sort {$tri{$b} <=> $tri{$a}} keys %tri; print "Total Trigraphs : ",$#trikeys,"\n"; print "Trigraph\tCount\n"; foreach (@trikeys) { print "$_\t$tri{$_}\n"; }
      Cool. You have a tiny bug:
      print "Total Trigraphs : ",$#trikeys,"\n";
      This undercounts the trigraphs by 1.

      Also, the s/\W//g; line is not doing anything. All the \W characters have already been discarded by the split.

      I have a not-quite-brute-force decipherer, but it's not really what leitchn was looking for. It assumes that you still know where the word boundaries are, and then it uses a heuristically guided search based partly on letter frequency and partly on repeated letter patterns. For example, if it sees the ciphertext ABCDDEFGHIJA, it guesses that the word is either glassworking, stanniferous, or scaffoldings. (If it isn't one of these, it won't be able to solve the puzzle, because it doesn't know the words.)

      Hee's the program that generates the pattern dictionary:

      #!/usr/bin/perl @DICTS = </usr/dict/*>; # @DICTS = ('/usr/dict/words'); load_dictionary(@DICTS); { local $, = "\0"; while (($pat, $words) = each %words) { print $pat, @$words, "\n"; } } sub pattern { my ($w) = @_; my $n = 'A'; while (my ($l) = $w =~ /([a-z])/) { $w =~ s/$l/$n/g; $n++; } $w; } sub load_dictionary { my $s = time; local @ARGV = @_; while (<>) { chomp; next unless /^[a-z]*$/; next if $is_word{$_}++; push @{$words{pattern($_)}}, $_; } continue { my $n = keys %is_word; print STDERR "$n words loaded.\n" if $n % 10000 == 0 && $n > 0; } my $e = time - $s; print STDERR "Elapsed time to load dictionary: $e.\n"; }
      I'd really like to find a faster way to do the pattern() function.