Re: Algorithm To Select Lines Based On Attributes

Update: nevermind about my solution. I missed your caveat about what types of rules you have. The advice given at the bottom still stands. Examples are better than prose for many things.

Instead of looking through each line for each defect, look for each defect in every line using an alternating regular expression. This lets you only look through each line once, and gives you the advantage of having the highly optimized regex engine do much of the work.

I'm not even sure where %rulelist or $rulenum are supposed to be set in the above.

Do negated defects just not add up, or do they actually remove a defect from the final count? Here I'll assume they just don't get added in.

If I'm not misunderstanding your spec, this does everything you need short of reading which defects interest you from another file:

use strict;
use warnings;

my @defects_to_check = qw( ATTR1 ATTR3 ATTR7 );
my $alternation = join '|(?<!!)', @defects_to_check;
# previous and next lines use negative look-behind to ensure
# only defects listed without '!' preceding them get matched
my $regex = qr/(?<!!)$alternation/;

open ( my $df, '<', 'defects_file' ) or die "can't read defects_file: 
+$!\n";
my $total_defects = 0;
while ( <$df> ) {
  next unless /^DEFECTID/;
  my @defects_found = $_ =~ m/$regex/g;
  $total_defects += scalar @defects_found;
  print "defects found this line: ", (join ', ', @defects_found), "\n"
+;
  print "total defects so far: $total_defects\n";
}
close $df;
[download]

Given this input file for defects:

DEFECTID ATTR1 ATTR7 ATTR4
DEFECTID ATTR3 !ATTR1
DEFECTID ATTR2 ATTR5 ATTR3
DEFECTID ATTR4

DEFECTID ATTR3
[download]

it produces this output:

defects found this line: ATTR1, ATTR7
total defects so far: 2
defects found this line: ATTR3
total defects so far: 3
defects found this line: ATTR3
total defects so far: 4
defects found this line:
total defects so far: 4
defects found this line: ATTR3
total defects so far: 5
[download]

Now, with a million lines, I'd probably not print the new defects found and the new total for every line. If you need to know which defects had what subtotals, you could accomplish that with a hash:

use strict;
use warnings;

my @defects_to_check = qw( ATTR1 ATTR3 ATTR7 );
my $alternation = join '|(?<!!)', @defects_to_check;
# previous and next lines use negative look-behind to ensure
# only defects listed without '!' preceding them get matched
my $regex = qr/(?<!!)$alternation/;

open ( my $df, '<', 'defects_file' ) or die "can't read defects_file: 
+$!\n";
my $total_defects = 0;
my %defect_subtotals;
while ( <$df> ) {
  next unless /^DEFECTID/;
  my @defects_found = $_ =~ m/$regex/g;
  $total_defects += scalar @defects_found;
  $defect_subtotals{ $_ }++ for @defects_found;
}
close $df;

print "Found $total_defects total defects.\nDefect breakdown follows:\
+n";
print $_ . ":\t\t" . $defect_subtotals{$_} . "\n" for sort keys %defec
+t_subtotals;
[download]

Given the same input file as above, it produces this output:

Found 5 total defects.
Defect breakdown follows:
ATTR1:          1
ATTR3:          3
ATTR7:          1
[download]

A sample of input and a sample of output like this is very helpful in determining whether we're talking about the same spec. If I've made any incorrect assumptions about your spec, please give your own sample input and output so a monk can write a program to match.

Comment on Re: Algorithm To Select Lines Based On Attributes Select or Download Code

Replies are listed 'Best First'.
Re^2: Algorithm To Select Lines Based On Attributes by ~~David~~ (Hermit) on Jan 15, 2009 at 17:58 UTC
Input Example: DefectRecordSpec 17 DEFECTID XREL YREL XINDEX YINDEX XSIZE YSIZE DEFEC +TAREA DSIZE CLASSNUMBER TEST CLUSTERNUMBER ROUGHBINNUMBER FINEBINNUMB +ER REVIEWSAMPLE IMAGECOUNT IMAGELIST ; DefectList 1 466.458 4741.229 -2 24 2.000 1.725 3.451440 2.000 0 1 0 0 0 0 0 0 2 5329.992 4264.499 -2 24 1.500 0.862 1.294290 1.500 0 1 0 0 0 0 0 0 3 1469.965 4591.523 -1 24 0.500 0.431 0.215715 0.500 0 1 0 0 0 0 0 0 4 7082.505 4283.913 -1 24 2.000 1.725 3.451440 2.000 0 1 0 0 0 0 0 0 5 777.809 2623.219 -2 24 1.000 0.862 0.862860 1.000 0 1 0 0 0 0 0 0 6 376.807 3904.135 -1 24 1.500 0.862 1.294290 1.500 0 1 0 0 0 0 0 0 7 5345.841 3877.818 0 24 0.500 0.431 0.215715 0.500 0 1 0 0 0 0 0 0 [download] I could make a rule XINDEX,-2 ( give me X-index of only -2. Negating this give me all values without XINDEX = -2 ). Output: `DefectRecordSpec 17 DEFECTID XREL YREL XINDEX YINDEX XSIZE YSIZE DEFEC +TAREA DSIZE CLASSNUMBER TEST CLUSTERNUMBER ROUGHBINNUMBER FINEBINNUMB +ER REVIEWSAMPLE IMAGECOUNT IMAGELIST ; DefectList 1 466.458 4741.229 -2 24 2.000 1.725 3.451440 2.000 0 1 0 0 0 0 0 0 2 5329.992 4264.499 -2 24 1.500 0.862 1.294290 1.500 0 1 0 0 0 0 0 0 5 777.809 2623.219 -2 24 1.000 0.862 0.862860 1.000 0 1 0 0 0 0 0 0` [download]	[reply] [d/l] [select]
Re^3: Algorithm To Select Lines Based On Attributes by gone2015 (Deacon) on Jan 15, 2009 at 21:36 UTC
You said the rules should be additive. If you're applying two rules, say XINDEX == -2, and XSIZE == 2, do you want to select defect 1 once or twice ? If twice, then I you're stuck with N * R work, where N is the number of records and R is the number of rules. If once, then one trick would be to run easy and/or most likely to match rules first. If the rules weren't additive, I'd suggest looking for ways to quickly exclude defects which definitely would not match any rule -- the idea being that you may be able to quickly hack away at the search space. So, as others have said, the problem is how to take your rules and render them into something that Perl can execute as quickly as possible. Here I think the basic trick is something along these lines: use strict ; use warnings ; my @DefectRecordSpec = qw(DEFECTID XREL YREL XINDEX YINDEX XSIZE YSIZE + DEFECTAREA DSIZE CLASSNUMBER TEST CLUSTERNUMBER ROUGHBINNUMBE +R FINEBINNUMBER REVIEWSAMPLE IMAGECOUNT IMAGELIST) ; my @DefectList = ( '1 466.458 4741.229 -2 24 2.000 1.725 3.451440 2.000 0 1 0 0 0 0 0 0 +', '2 5329.992 4264.499 -2 24 1.500 0.862 1.294290 1.500 0 1 0 0 0 0 0 0 +', '3 1469.965 4591.523 -1 24 0.500 0.431 0.215715 0.500 0 1 0 0 0 0 0 0 +', '4 7082.505 4283.913 -1 24 2.000 1.725 3.451440 2.000 0 1 0 0 0 0 0 0 +', '5 777.809 2623.219 -2 24 1.000 0.862 0.862860 1.000 0 1 0 0 0 0 0 0 +', '6 376.807 3904.135 -1 24 1.500 0.862 1.294290 1.500 0 1 0 0 0 0 0 0 +', '7 5345.841 3877.818 0 24 0.500 0.431 0.215715 0.500 0 1 0 0 0 0 0 0 +', ) ; my @rules = ( '$XINDEX == -2', '($XSIZE > 1.75) && ($XSIZE <= 2.15)', ) ; my $fields = '$'. join(', $', @DefectRecordSpec) ; my $rules = join("\n \|\| ", map "($_)", @rules) ; my $sub = join("\n", 'sub {', ' my ($line) = @_ ;', ' $line =~ s/^\s+// ;', # Trim leading spaces, so split is not foo +led # NB: the split will discard trailing spac +es and newline ' my ('. $fields .') = split(/\s+/, $line) ;', ' return '. $rules .' ;', '} ;' ) ; my $test = eval $sub ; print $sub, "\n" ; foreach my $d (@DefectList) { print "$d\n" if $test->($d) ; } ; [download] where your rules are each small Perl expressions, so can be arbitrarily complicated.	[reply] [d/l]


Perl: the Markov chain saw
	PerlMonks