Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Basic Question: Is there any simple way in Perl to find the most common substring?

Complex Question: I'd really like to take a long number (say 100,000 digits long) and find the least and the most common strings of a given length (say 5). The three approaches that come to mind are:

  1. Set up a database with all possible numbers of length 5 and run regex's against the string, updating the database as I go (requires 10^5 passes of the string)
  2. Move across the string once and read all possible 5-digit strings, storing them in a DB as I go. Comparing the contents of the DB-table with a list of all possible 5-mers should identify any "missing" 5-digit strings.
  3. Search the string for the most common word, mask it somehow, and then repeat until the entire string is "masked"

Somehow the last approach seems most "elegant" to me, but I'm fuzzy on it. But general comments on how to go about this are totally helpful. I had troubles figuring out what terms to google to find out about algorithms that might do this sort of "parsing of all substrings".

Replies are listed 'Best First'.
Re: Most common substring
by shemp (Deacon) on Jul 28, 2003 at 19:02 UTC
    a Huffman coding algorithm should find the most common substrings. That is not the ultimate goal of Huffman coding, but it is an intermediate result, so that part of the algorithm could be used.

    Check out Algorithm::Huffman at CPAN.
    Check out a book on data compression
    Also, there is a good explanation of Huffman coding in general, in the digest of Perl Journal articles Computer Science & Perl Programming(the article is probably available by itself from the perl journal)

    Update:My suggestion won't really work. Optimal Huffman coding will certainly find the most common substrings, but not necessarily the longest, or necessarily of a certain size.
    Sorry, i'm not sure if some twist on a Huffman algorithm will work, but some of the other posts end up doing effectively what i envision a hacked Huffman algorithm would do.
    :(
Re: Most common substring
by dga (Hermit) on Jul 28, 2003 at 19:10 UTC

    This could be done in one pass through the string.

    Conceptually it would work like so.

    use strict; my @nums=split('', $number); # the number to work on my(@most_common, %once, %common); my $mcc=2; # if there are no common substrings don't store while( @nums > 4) { my $key = join('', @nums[0..4]); $common{$key}++; if($common{$key} > $mcc) { $mcc=$common{$key}; @most_common=($key); #new max set entire array to $key } elsif($common{$key} == $mcc) { push(@most_common, $key); # tack $key onto largest } if($common{$key} == 1) { $once{$key}=1; } elsif($once{$key}) { delete $once{$key}; } shift @nums; #slide down 1 digit } print "Most ($mcc): ", join(', ', @most_common), "\n"; print "Once: ", join(', ', (keys %once)), "\n";

    I have added a bit of code to track the most common values and save them in an array. I would expect that least common will most likely be the set of substrings only used once. Unless you define common to be lowest but more than once. Then much more bookkeeping code would be needed for the low end. On the high end if the current case is larger than the previous largest we simply forget about the list and start a new list with the current substring as it's only member. I also started $mcc at 2 to avoid a lot of needless bookkeeping for substrings seen only once.

    I ran the code on sample data of 2^9999 and got

    Most (2): 96655, 84403, 66114, 11748, 17484, 40380, 74169, 41696, 41844, 47194, 71162, 92065, 54736, 28703, 84689, 22165, 92292, 47369, 41891, 87379, 37954, 04224, 42244, 08257, 35778, 23461, 29741, 19795, 79549, 78117, 56688, 58090, 43252, 32528, 42018, 98726, 03714, 41492, 24440, 01363, 40657, 90170, 41347, 48935, 89357

    Once: every other substring which is a long list.

Re: Most common substring
by waswas-fng (Curate) on Jul 28, 2003 at 19:00 UTC
    if you have enough memory you can substr 5 chars at a time into a hash and ++ the key, then inc the index of the substr until you have hashed the entire number. then sort by the largest value of the hash. This will take a lot of memory though so you would need a beefy box (or use Tie::Hash to save memory). The second advantage to this is you can see the rank of all 5 char sequences afterproccessing the file 1 time.

    -Waswas
Re: Most common substring
by dragonchild (Archbishop) on Jul 28, 2003 at 19:18 UTC
    Actually, your third algorithm is not likely to produce the results you want. Consider the following code snippet:
    #!/usr/bin/perl my $string = "123123124"; my $len = 5; my %substrings; for (my $i = 0; $i + $len <= length $string; $i++) { my $sub = substr($string, $i, $len); $substrings{$sub}++; } print "$_ => $substrings{$_}\n" for sort { $substrings{$b} <=> $substrings{$a} || $a cmp $b } keys %substrings;
    If you were to mask out 12312 right after finding it, you would remove the possibility of finding 23123. That's not so good.

    By the way, the above algorithm naively implements a method suggested by the first responder. I think that a possible savings (trading a lot of time for RAM) would be to use a file for the hash storage. But, I would recommend trying it out first before attempting to optimize it.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Most common substring
by Thelonius (Priest) on Jul 28, 2003 at 19:23 UTC
    If it's too big to fit in memory, my choice would be to use an external sort (e.g. the unix sort command). Write all the substring to a file, then "sort | uniq -c | sort -rn".

    That's the whole program.

Re: Most common substring
by sgifford (Prior) on Jul 28, 2003 at 20:40 UTC

    If everything is a digit, you can use an array instead of a hash to store the counts. In some quick tests, that uses about a third as much memory. But even using a hash, storing all possible 5-digit patterns only takes about 9MB, so no matter how you do it the 2nd approach is very tempting.

    Using numbers instead of strings also lets you switch to the next number with multiply, modulus, and addition operators, which may be faster than manipulating strings.

Re: Most common substring
by halley (Prior) on Jul 28, 2003 at 20:49 UTC

    Typical language analysis goes with your Option #2. I think I'll submit a code snippet which I used on some texts to extend some Moby Project results.

    I'm not sure what you'd accomplish with the Option #3. You say you want to find the most common substring, but then your #3 says you'd mask out all occurrences of the most common "word." Don't you need to solve the question before you can perform that work? And secondly, if you wanted the second-most-common substring, is masking out the topmost common substrings going to be what you want?

    off the record, heretofore, the officer found that in the theater of war, one hath need of a weatherproof theory of games

    Now, remove substring 'he' and you won't find 'the'. Remove 'of' and you won't find 'off'. Is this effect intentional in your search? (Sorry, I couldn't contrive an example with overlapping 5-character snippets.)

    --
    [ e d @ h a l l e y . c c ]

      Thanks all -- it appears my second approach is by consensus the most efficient. Option three never really took form in my head properly, and apparently won't work the way I was hoping. This has been extremely helpful!