when you first mentioned this in the chatterbox you asked if anyone had any suggestions to improve it. So here goes FWIW.

You are writing your matches to a temporary file and then reading them in again to construct a hash. That means that if you find any substrings with more than one occurrence, you will end up with entries in your data file for all of them. For instance if you find that token you mentioned and it's in every line of your - lets say - 500 record file. Then your first mention in the text file says 500 occurrences, the next says 499 and so on down to 2 occurrences. You don't get an entry that says one, but you will have checked for it.

If instead of writting to that file you created the hash as you process the file, then right at the top you can just check if it exists. If it does then dont bother checking any further, you've already found all these matches. For the example I gave this equates to leaving out 124750 checks and that's just for one pattern.

like this:

do { my $p = $packets[0]; foreach my $l ($level..length($p)) { foreach my $pos (0..length($p)-$l) { my $str = substr($p,$pos,$l); next if exists $all{$str}; # if we've already found this # string somewhere else, exit # this iteration my $num = 0; for (0..$#packets) { if ($l <= length($packets[$_])) { pos($packets[$_]) = 0; while ($packets[$_] =~ /$str/g) { $num ++; } } } unless (exists $all{$str}) { $all{$str} = $num unless $num < $threshold; } } } shift(@packets); } while ($#packets >= 0);

Nuance


In reply to RE: Substring Finding/Counting by nuance
in thread Substring Finding/Counting by Guildenstern

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.