Most common substring

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Most common substring by shemp (Deacon) on Jul 28, 2003 at 19:02 UTC
a Huffman coding algorithm should find the most common substrings. That is not the ultimate goal of Huffman coding, but it is an intermediate result, so that part of the algorithm could be used. Check out Algorithm::Huffman at CPAN. Check out a book on data compression Also, there is a good explanation of Huffman coding in general, in the digest of Perl Journal articles Computer Science & Perl Programming(the article is probably available by itself from the perl journal) Update:My suggestion won't really work. Optimal Huffman coding will certainly find the most common substrings, but not necessarily the longest, or necessarily of a certain size. Sorry, i'm not sure if some twist on a Huffman algorithm will work, but some of the other posts end up doing effectively what i envision a hacked Huffman algorithm would do. :(	[reply]
Re: Most common substring by dga (Hermit) on Jul 28, 2003 at 19:10 UTC
This could be done in one pass through the string. Conceptually it would work like so. use strict; my @nums=split('', $number); # the number to work on my(@most_common, %once, %common); my $mcc=2; # if there are no common substrings don't store while( @nums > 4) { my $key = join('', @nums[0..4]); $common{$key}++; if($common{$key} > $mcc) { $mcc=$common{$key}; @most_common=($key); #new max set entire array to $key } elsif($common{$key} == $mcc) { push(@most_common, $key); # tack $key onto largest } if($common{$key} == 1) { $once{$key}=1; } elsif($once{$key}) { delete $once{$key}; } shift @nums; #slide down 1 digit } print "Most ($mcc): ", join(', ', @most_common), "\n"; print "Once: ", join(', ', (keys %once)), "\n"; [download] I have added a bit of code to track the most common values and save them in an array. I would expect that least common will most likely be the set of substrings only used once. Unless you define common to be lowest but more than once. Then much more bookkeeping code would be needed for the low end. On the high end if the current case is larger than the previous largest we simply forget about the list and start a new list with the current substring as it's only member. I also started $mcc at 2 to avoid a lot of needless bookkeeping for substrings seen only once. I ran the code on sample data of 2^9999 and got Most (2): 96655, 84403, 66114, 11748, 17484, 40380, 74169, 41696, 41844, 47194, 71162, 92065, 54736, 28703, 84689, 22165, 92292, 47369, 41891, 87379, 37954, 04224, 42244, 08257, 35778, 23461, 29741, 19795, 79549, 78117, 56688, 58090, 43252, 32528, 42018, 98726, 03714, 41492, 24440, 01363, 40657, 90170, 41347, 48935, 89357 Once: every other substring which is a long list.	[reply] [d/l]
Re: Most common substring by waswas-fng (Curate) on Jul 28, 2003 at 19:00 UTC
if you have enough memory you can substr 5 chars at a time into a hash and ++ the key, then inc the index of the substr until you have hashed the entire number. then sort by the largest value of the hash. This will take a lot of memory though so you would need a beefy box (or use Tie::Hash to save memory). The second advantage to this is you can see the rank of all 5 char sequences afterproccessing the file 1 time. -Waswas	[reply]
Re: Most common substring by dragonchild (Archbishop) on Jul 28, 2003 at 19:18 UTC
Actually, your third algorithm is not likely to produce the results you want. Consider the following code snippet: `#!/usr/bin/perl my $string = "123123124"; my $len = 5; my %substrings; for (my $i = 0; $i + $len <= length $string; $i++) { my $sub = substr($string, $i, $len); $substrings{$sub}++; } print "$_ => $substrings{$_}\n" for sort { $substrings{$b} <=> $substrings{$a} \|\| $a cmp $b } keys %substrings;` [download] If you were to mask out 12312 right after finding it, you would remove the possibility of finding 23123. That's not so good. By the way, the above algorithm naively implements a method suggested by the first responder. I think that a possible savings (trading a lot of time for RAM) would be to use a file for the hash storage. But, I would recommend trying it out first before attempting to optimize it. ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]
Re: Most common substring by Thelonius (Priest) on Jul 28, 2003 at 19:23 UTC
If it's too big to fit in memory, my choice would be to use an external sort (e.g. the unix sort command). Write all the substring to a file, then "sort \| uniq -c \| sort -rn". That's the whole program.	[reply]
Re: Most common substring by sgifford (Prior) on Jul 28, 2003 at 20:40 UTC
If everything is a digit, you can use an array instead of a hash to store the counts. In some quick tests, that uses about a third as much memory. But even using a hash, storing all possible 5-digit patterns only takes about 9MB, so no matter how you do it the 2nd approach is very tempting. Using numbers instead of strings also lets you switch to the next number with multiply, modulus, and addition operators, which may be faster than manipulating strings.	[reply]
Re: Most common substring by halley (Prior) on Jul 28, 2003 at 20:49 UTC
Typical language analysis goes with your Option #2. I think I'll submit a code snippet which I used on some texts to extend some Moby Project results. I'm not sure what you'd accomplish with the Option #3. You say you want to find the most common substring, but then your #3 says you'd mask out all occurrences of the most common "word." Don't you need to solve the question before you can perform that work? And secondly, if you wanted the second-most-common substring, is masking out the topmost common substrings going to be what you want? `off the record, heretofore, the officer found that in the theater of war, one hath need of a weatherproof theory of games` [download] Now, remove substring 'he' and you won't find 'the'. Remove 'of' and you won't find 'off'. Is this effect intentional in your search? (Sorry, I couldn't contrive an example with overlapping 5-character snippets.) -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l]
Re: Re: Most common substring by Anonymous Monk on Jul 29, 2003 at 19:20 UTC
Thanks all -- it appears my second approach is by consensus the most efficient. Option three never really took form in my head properly, and apparently won't work the way I was hoping. This has been extremely helpful!	[reply]