a Huffman coding algorithm should find the most common substrings. That is not the ultimate goal of Huffman coding, but it is an intermediate result, so that part of the algorithm could be used.
Check out Algorithm::Huffman at CPAN.
Check out a book on data compression
Also, there is a good explanation of Huffman coding in general, in the digest of Perl Journal articles Computer Science & Perl Programming(the article is probably available by itself from the perl journal)
Update:My suggestion won't really work. Optimal Huffman coding will certainly find the most common substrings, but not necessarily the longest, or necessarily of a certain size.
Sorry, i'm not sure if some twist on a Huffman algorithm will work, but some of the other posts end up doing effectively what i envision a hacked Huffman algorithm would do.
:( | [reply] |
This could be done in one pass through the string.
Conceptually it would work like so.
use strict;
my @nums=split('', $number); # the number to work on
my(@most_common, %once, %common);
my $mcc=2; # if there are no common substrings don't store
while( @nums > 4)
{
my $key = join('', @nums[0..4]);
$common{$key}++;
if($common{$key} > $mcc)
{
$mcc=$common{$key};
@most_common=($key); #new max set entire array to $key
}
elsif($common{$key} == $mcc)
{
push(@most_common, $key); # tack $key onto largest
}
if($common{$key} == 1)
{
$once{$key}=1;
}
elsif($once{$key})
{
delete $once{$key};
}
shift @nums; #slide down 1 digit
}
print "Most ($mcc): ", join(', ', @most_common), "\n";
print "Once: ", join(', ', (keys %once)), "\n";
I have added a bit of code to track the most common values and save them in an array. I would expect that least common will most likely be the set of substrings only used once. Unless you define common to be lowest but more than once. Then much more bookkeeping code would be needed for the low end. On the high end if the current case is larger than the previous largest we simply forget about the list and start a new list with the current substring as it's only member. I also started $mcc at 2 to avoid a lot of needless bookkeeping for substrings seen only once.
I ran the code on sample data of 2^9999 and got
Most (2): 96655, 84403, 66114, 11748, 17484, 40380, 74169, 41696, 41844, 47194, 71162, 92065, 54736, 28703, 84689, 22165, 92292, 47369, 41891, 87379, 37954, 04224, 42244, 08257, 35778, 23461, 29741, 19795, 79549, 78117,
56688, 58090, 43252, 32528, 42018, 98726, 03714, 41492, 24440, 01363, 40657, 90170, 41347, 48935, 89357
Once: every other substring which is a long list. | [reply] [d/l] |
if you have enough memory you can substr 5 chars at a time into a hash and ++ the key, then inc the index of the substr until you have hashed the entire number. then sort by the largest value of the hash. This will take a lot of memory though so you would need a beefy box (or use Tie::Hash to save memory). The second advantage to this is you can see the rank of all 5 char sequences afterproccessing the file 1 time.
-Waswas | [reply] |
Actually, your third algorithm is not likely to produce the results you want. Consider the following code snippet:
#!/usr/bin/perl
my $string = "123123124";
my $len = 5;
my %substrings;
for (my $i = 0; $i + $len <= length $string; $i++)
{
my $sub = substr($string, $i, $len);
$substrings{$sub}++;
}
print "$_ => $substrings{$_}\n" for sort {
$substrings{$b} <=> $substrings{$a}
||
$a cmp $b
} keys %substrings;
If you were to mask out 12312 right after finding it, you would remove the possibility of finding 23123. That's not so good.
By the way, the above algorithm naively implements a method suggested by the first responder. I think that a possible savings (trading a lot of time for RAM) would be to use a file for the hash storage. But, I would recommend trying it out first before attempting to optimize it.
------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified. | [reply] [d/l] |
| [reply] |
If everything is a digit, you can use an array instead of a hash to store the counts. In some quick tests, that uses about a third as much memory. But even using a hash, storing all possible 5-digit patterns only takes about 9MB, so no matter how you do it the 2nd approach is very tempting.
Using numbers instead of strings also lets you switch to the next number with multiply, modulus, and addition operators, which may be faster than manipulating strings.
| [reply] |
Typical language analysis goes with your Option #2. I think I'll submit a code snippet which I used on some texts to extend some Moby Project results.
I'm not sure what you'd accomplish with the Option #3. You say you want to find the most common substring, but then your #3 says you'd mask out all occurrences of the most common "word." Don't you need to solve the question before you can perform that work? And secondly, if you wanted the second-most-common substring, is masking out the topmost common substrings going to be what you want?
off the record, heretofore, the officer found that in the
theater of war, one hath need of a weatherproof theory of games
Now, remove substring 'he' and you won't find 'the'. Remove 'of' and you won't find 'off'. Is this effect intentional in your search? (Sorry, I couldn't contrive an example with overlapping 5-character snippets.)
-- [ e d @ h a l l e y . c c ] | [reply] [d/l] |
Thanks all -- it appears my second approach is by consensus the most efficient. Option three never really took form in my head properly, and apparently won't work the way I was hoping. This has been extremely helpful!
| [reply] |