This snippet scans an input and counts occurrences of substrings. For the input sentence:
off the record, heretofore, the officer found that in the theater of war, one hath need of a weatherproof theory of games
The output would begin:
8<th> 7<of> 7<he> 6<the> 6< th> 6< t> 5< o> 5<f > 5< the> 4<at> 4<of > 4<e > 4<er> 4< of>
The report format is similar to files included in the Moby Lexicon Project.
#!/usr/bin/perl # Write a substring analysis of an input text, like Moby's sample. # All whitespace is considered a single space. use strict; use Getopt::Long; my %Substrings; my $Minimum = 2; my $Shortest = 2; my $Longest = 5; my $Limit = 500; GetOptions('minimum=i' => \$Minimum, 'shortest=i' => \$Shortest, 'longest=i' => \$Longest, 'limit=i' => \$Limit); exit(main(@ARGV)); sub main { my $input; do { local $/ = undef; $input = <>; }; $input =~ s/\n/ /gs; $input =~ s/\s+/ /gs; for my $span ($Shortest .. $Longest) { for my $pos (0 .. length($input)-$span) { $Substrings{ substr($input, $pos, $span) }++; } } my $count = 0; foreach (grep { not $Limit or $count++ < $Limit } grep { $Substrings{$_} >= $Minimum } sort { $Substrings{$b} <=> $Substrings{$a} } keys %Substrings) { print $Substrings{$_}, '<', $_, '>', "\n"; } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: count moby substrings
by mildside (Friar) on Jul 30, 2003 at 06:46 UTC | |
by halley (Prior) on Jul 30, 2003 at 11:46 UTC | |
by Aristotle (Chancellor) on Jul 31, 2003 at 20:08 UTC |