Guildenstern has asked for the wisdom of the Perl Monks concerning the following question:

Okay. As some of you may remember, I asked a question a couple of days ago about comparing packet data to find common substrings. (See My Original Post)
I have written some code (see below) that appears to produce valid output. However, when I was looking through the output, I've noticed that some of the substring match counts are wrong. For a sample run, I copied the same packet data four times, which should have made the output a little easier to check.
The first line of output is a substring that contains all of the packet data, so it makes sense that there are four matches. The second line bothers me, though. It's a substring length($packet)-1 in size. Eye-grepping the input, I can see that this should have four matches as well, but when I run the script I get 7!
I'm not asking for an in-depth analysis, but I've been looking at it so much in the last couple of days that I may be missing something obvious. Any help is appreciated!
die "No file name" unless @ARGV; my @packets = (); my $short = 9999; while (<>) { push(@packets,$_); if (length($_)-1 < $short) { $short = length($_) -1; # Find shortest packet } } chop(@packets); my $maxlen =0; my $index = 0; my $curpack; my %all; foreach $curpack (@packets) { #Iterate over packets $maxlen = length($curpack); for (my $fsize = 2; $fsize <= $maxlen; $fsize ++) { #Check size 2..$maxlen substrings for (my $pnum = 0; $pnum < @packets; $pnum ++) { #Iterate over packets to check against for (0..($maxlen-$fsize+1)) { #Don't compare against self or shorter strings if (($index != $pnum) && ($fsize <= length($packets[$pnum]))) { my $str = substr($curpack,$_,$fsize); my @temparr = ($packets[$pnum] =~ /$str/g); my $nmatch = @temparr; if (defined($all{$str})) { $all{$str} += $nmatch; } else { #Remember to count self $all{$str} = $nmatch + 1; } undef @temparr; } } } } $index ++; print "Comparing packet ",$index," to all others:\n"; my $value; my @sort = sort {length($b) <=> length($a)} keys %all; foreach $value (@sort) { #Show only results in >1 packet if ($all{$value} > 1) { print "(",length($value),") $value: $all{$value} times\n"; } delete($all{$value}); } undef @sort; } exit;

Replies are listed 'Best First'.
Re: Packet Patterns redux - my script can't count
by ZZamboni (Curate) on Aug 08, 2000 at 20:56 UTC
    The first thing that comes to mind as a possible source of your problem is the way you are comparing, using a regular expression:
    my @temparr = ($packets[$pnum] =~ /$str/g);
    This is potentially inaccurate because if the ASCII representation of $str includes any regular expression metacharacters (like dots) it may count things that should not match. It is also potentially dangerous because some of those characters may cause your program to crash because they form an invalid regex. You could get around that by using \Q and \E, like this:
    my @temparr = ($packets[$pnum] =~ /\Q$str\E/g);
    The second thing is that the logic of your loops seems a little too convoluted. It could probably be rewritten as something like this (untested):
    foreach my $p (@packets) { foreach my $l (2..length($p)) { foreach my $pos (0..length($p)-$l) { my $str=substr($p, $pos, $l); $all{$str}+=$_ for map { scalar(($_ =~ /\Q$str\E/g)) } @packets; } } }
    The map builds a list with the count of how many times $str appears in each element from @packets, and the for in that same line adds all those counts to the corresponding element of %all. I think it's essentially the same algorithm you had before, except that it does count the current packet in the matches.

    --ZZamboni

      Wow! Thanks!
      Added your code and edited mine to fit it, and I get numbers that are closer to what I'm expecting. I'd love to say they're perfect, but now it appears that each value is in my list with its count value doubled. (Of course, I have the occasional 3 or 5, but I think I know where those are coming from.)
      What's happening is that some packets are nearly identical to each other, so when packet X is tested, the map finds N matches with packet Y. But, when packet Y is tested, the same N matches are found with packet X. I'll bang my head on the desk for a minute or two and see if the answer rattles loose.
      Again, thanks for the help. Not only did it correct the behavior, but it's ever so much more elegant. Too bad I've already used my votes today.