I mentioned in Re: Most common substring that I had a snippet to scan a text and report the most common substrings. This is useful in language analysis problems as well as some forms of encryption and compression.

This snippet scans an input and counts occurrences of substrings. For the input sentence:

off the record, heretofore, the officer found that in the theater of war, one hath need of a weatherproof theory of games

The output would begin:

8<th> 7<of> 7<he> 6<the> 6< th> 6< t> 5< o> 5<f > 5< the> 4<at> 4<of > 4<e > 4<er> 4< of>

The report format is similar to files included in the Moby Lexicon Project.

#!/usr/bin/perl # Write a substring analysis of an input text, like Moby's sample. # All whitespace is considered a single space. use strict; use Getopt::Long; my %Substrings; my $Minimum = 2; my $Shortest = 2; my $Longest = 5; my $Limit = 500; GetOptions('minimum=i' => \$Minimum, 'shortest=i' => \$Shortest, 'longest=i' => \$Longest, 'limit=i' => \$Limit); exit(main(@ARGV)); sub main { my $input; do { local $/ = undef; $input = <>; }; $input =~ s/\n/ /gs; $input =~ s/\s+/ /gs; for my $span ($Shortest .. $Longest) { for my $pos (0 .. length($input)-$span) { $Substrings{ substr($input, $pos, $span) }++; } } my $count = 0; foreach (grep { not $Limit or $count++ < $Limit } grep { $Substrings{$_} >= $Minimum } sort { $Substrings{$b} <=> $Substrings{$a} } keys %Substrings) { print $Substrings{$_}, '<', $_, '>', "\n"; } }

Replies are listed 'Best First'.
Re: count moby substrings
by mildside (Friar) on Jul 30, 2003 at 06:46 UTC
    Hi hally. I'm trying to work out why you used the do construct on line 27? Why is that needed as opposed to a bare block with no do? I understand that do can be used with while, or to overide the loop-like nature of bare blocks when used with next or last, neither of which is the case here.

    Cheers!

      Just a matter of style-- bare braces in the middle of nowhere make me itch, looking for an 'if' line that may have been deleted or obscured. Same reason I explicitly handle newlines and whitespace in two passes, though that's not really necessary.

      --
      [ e d @ h a l l e y . c c ]

        I prefer to write it this way:
        my $input = do { local $/; <> };

        Makeshifts last the longest.