comment on

One way of starting to solve simple-substitution cryptograms is to try to find the vowels. I look for high-frequency letters that tend not to contact each other very frequently. (I've seen other approaches to crytograms here, like merlyn's pat program, but nothing for vowels).

I would like to automate the process of finding vowels, and it seems like a tree cluster analysis algorithm might work, as described on this page.

The distance measure is a little tricky. I would like the distance measure from letter 'x' to letter 'y' to be a percent disagreement, (perhaps number of times 'x' contacts 'y' over the total number of contacts for 'x' and 'y'). (The more contacts the greater the distance, since I am looking for letters that avoid each other).

I took a look at Algorithm::Cluster, but it doesn't seem to be directly applicable to this case. It's more for genetic data with real-number values.

Here is my starting code, (just getting the single-letter and digram frequencies). Any suggestions on modules I could use to help solve this?

use strict;
use Statistics::Frequency;
use FileHandle;
use Data::Dumper;

sub simplifyText {
   my $txt = shift;
   $txt =~ s/\s+/ /g;
   $txt =~ tr/A-Z/a-z/;
   $txt =~ tr[.(),/:][]d;
   return $txt;
}

my $f1  = Statistics::Frequency->new;

my $fn = "/net/fox/vol02/tallman/notes/dynamac";
my $fh = new FileHandle("<" . $fn);
defined $fh or die "Cannot open $fn: $!\n";
local $/ = undef;

my $text = <$fh>;
$text = simplifyText($text);
print "text *$text*\n";

my @txt = split //,$text;
my @txt_nospaces = grep { $_ ne ' ' } @txt;
$f1->add_data(\@txt_nospaces);

my $f2 = Statistics::Frequency->new;
my $last = undef;
my $letter;
my @pairs = ();
foreach $letter (@txt) {
   if ($letter eq ' ') {
      $last = undef;
      next;
   }
   push @pairs,($last . $letter) if defined $last;
   $last = $letter;
}
$f2->add_data(\@pairs);

my %freq = $f1->frequencies;
print Data::Dumper->Dump([\%freq],["*freq"]);

my %freq2 = $f2->frequencies;
print Data::Dumper->Dump([\%freq2],["*freq2"]);
[download]

In reply to Finding vowels in a cryptogram by tall_man

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.