Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Re: Letter frequencies

by boo_radley (Parson)
on Nov 24, 2000 at 08:57 UTC ( [id://43192]=note: print w/replies, xml ) Need Help??

in reply to Re: Letter frequencies
in thread Letter frequencies

And here you go. I &heart programming excercises like this. Reads files, outputs a reverse sorted list based on # of occurences. Note that some trigraphs may not be totally valid because of the s/\W//g;. for example, "this is a line.This is another one" will yield "lineThis" as a word to be trigraphed. This is trivial to fix, though :)

Update yes, it was.
I changed
s/ //g;
to s/\W//g; thanks, Albannach! (wave)

# # # use strict; my %symbol; my %tri; my @trikeys; my $line; my $ctr; print "Processing file...\n"; while ($line = <>) { for (split /\W/,$line) { #discard all non-alpha. be *greedy* s/\W//g; (length($_)>2) && $symbol{lc($_)}++; } } print "Collecting trigraphs...\n"; foreach (keys %symbol){ for ($ctr=0; $ctr <= (length($_)-3);$ctr++) { $tri{lc(substr ($_,$ctr,3))}+= $symbol{$_}; } } @trikeys = sort {$tri{$b} <=> $tri{$a}} keys %tri; print "Total Trigraphs : ",$#trikeys,"\n"; print "Trigraph\tCount\n"; foreach (@trikeys) { print "$_\t$tri{$_}\n"; }

Replies are listed 'Best First'.
Re: Letter frequencies
by Dominus (Parson) on Nov 24, 2000 at 17:29 UTC
    Cool. You have a tiny bug:
    print "Total Trigraphs : ",$#trikeys,"\n";
    This undercounts the trigraphs by 1.

    Also, the s/\W//g; line is not doing anything. All the \W characters have already been discarded by the split.

    I have a not-quite-brute-force decipherer, but it's not really what leitchn was looking for. It assumes that you still know where the word boundaries are, and then it uses a heuristically guided search based partly on letter frequency and partly on repeated letter patterns. For example, if it sees the ciphertext ABCDDEFGHIJA, it guesses that the word is either glassworking, stanniferous, or scaffoldings. (If it isn't one of these, it won't be able to solve the puzzle, because it doesn't know the words.)

    Hee's the program that generates the pattern dictionary:

    #!/usr/bin/perl @DICTS = </usr/dict/*>; # @DICTS = ('/usr/dict/words'); load_dictionary(@DICTS); { local $, = "\0"; while (($pat, $words) = each %words) { print $pat, @$words, "\n"; } } sub pattern { my ($w) = @_; my $n = 'A'; while (my ($l) = $w =~ /([a-z])/) { $w =~ s/$l/$n/g; $n++; } $w; } sub load_dictionary { my $s = time; local @ARGV = @_; while (<>) { chomp; next unless /^[a-z]*$/; next if $is_word{$_}++; push @{$words{pattern($_)}}, $_; } continue { my $n = keys %is_word; print STDERR "$n words loaded.\n" if $n % 10000 == 0 && $n > 0; } my $e = time - $s; print STDERR "Elapsed time to load dictionary: $e.\n"; }
    I'd really like to find a faster way to do the pattern() function.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://43192]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-19 19:06 GMT
Find Nodes?
    Voting Booth?

    No recent polls found