comment on

Update: nevermind about my solution. I missed your caveat about what types of rules you have. The advice given at the bottom still stands. Examples are better than prose for many things.

Instead of looking through each line for each defect, look for each defect in every line using an alternating regular expression. This lets you only look through each line once, and gives you the advantage of having the highly optimized regex engine do much of the work.

I'm not even sure where %rulelist or $rulenum are supposed to be set in the above.

Do negated defects just not add up, or do they actually remove a defect from the final count? Here I'll assume they just don't get added in.

If I'm not misunderstanding your spec, this does everything you need short of reading which defects interest you from another file:

use strict;
use warnings;

my @defects_to_check = qw( ATTR1 ATTR3 ATTR7 );
my $alternation = join '|(?<!!)', @defects_to_check;
# previous and next lines use negative look-behind to ensure
# only defects listed without '!' preceding them get matched
my $regex = qr/(?<!!)$alternation/;

open ( my $df, '<', 'defects_file' ) or die "can't read defects_file: 
+$!\n";
my $total_defects = 0;
while ( <$df> ) {
  next unless /^DEFECTID/;
  my @defects_found = $_ =~ m/$regex/g;
  $total_defects += scalar @defects_found;
  print "defects found this line: ", (join ', ', @defects_found), "\n"
+;
  print "total defects so far: $total_defects\n";
}
close $df;
[download]

Given this input file for defects:

DEFECTID ATTR1 ATTR7 ATTR4
DEFECTID ATTR3 !ATTR1
DEFECTID ATTR2 ATTR5 ATTR3
DEFECTID ATTR4

DEFECTID ATTR3
[download]

it produces this output:

defects found this line: ATTR1, ATTR7
total defects so far: 2
defects found this line: ATTR3
total defects so far: 3
defects found this line: ATTR3
total defects so far: 4
defects found this line:
total defects so far: 4
defects found this line: ATTR3
total defects so far: 5
[download]

Now, with a million lines, I'd probably not print the new defects found and the new total for every line. If you need to know which defects had what subtotals, you could accomplish that with a hash:

use strict;
use warnings;

my @defects_to_check = qw( ATTR1 ATTR3 ATTR7 );
my $alternation = join '|(?<!!)', @defects_to_check;
# previous and next lines use negative look-behind to ensure
# only defects listed without '!' preceding them get matched
my $regex = qr/(?<!!)$alternation/;

open ( my $df, '<', 'defects_file' ) or die "can't read defects_file: 
+$!\n";
my $total_defects = 0;
my %defect_subtotals;
while ( <$df> ) {
  next unless /^DEFECTID/;
  my @defects_found = $_ =~ m/$regex/g;
  $total_defects += scalar @defects_found;
  $defect_subtotals{ $_ }++ for @defects_found;
}
close $df;

print "Found $total_defects total defects.\nDefect breakdown follows:\
+n";
print $_ . ":\t\t" . $defect_subtotals{$_} . "\n" for sort keys %defec
+t_subtotals;
[download]

Given the same input file as above, it produces this output:

Found 5 total defects.
Defect breakdown follows:
ATTR1:          1
ATTR3:          3
ATTR7:          1
[download]

A sample of input and a sample of output like this is very helpful in determining whether we're talking about the same spec. If I've made any incorrect assumptions about your spec, please give your own sample input and output so a monk can write a program to match.

In reply to Re: Algorithm To Select Lines Based On Attributes by mr_mischief
in thread Algorithm To Select Lines Based On Attributes by ~~David~~

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.