in reply to a question about making a word frequency matrix

A solution on which you can build:

sub max2 { $_[0] > $_[1] ? $_[0] : $_[1] } my $filename = ...; my %ignore = map { lc($_) => 1 } qw( a an are i if in is it m on re s to the ... ); open(my $fh_in, '<', $filename) or die("Unable to open input file: $!\n"); my %counts; while (<$fh_in>) { $_ = lc($_); while (/([a-z]+)/g) { next if $ignore{$1}; ++$counts{$1}; } } my @words_ordered = sort { $counts{$b} <=> $counts{$a} } keys %counts; foreach (0 .. max2(99, $#words_ordered)) { print("Word $_ was found $counts{$_} times.\n"); }

It's simplistic! For example, "It's Jeff" is broken down into "it", "s" and "jeff".

Update: Added %ignore. Needs better. Maybe we could automatically ignore words of length less than 4 unless they were originally uppercase.