Re: Recommendations for efficient data reduction/substitution application

(Roll-up of answers to the responses thus far, with my appreciation!)

Re: hdb (Re: Recommendations for efficient data reduction/substitution application):
Yes, the application of the substitutions is the process to remove the noise. The records may include errors that have not been seen in production before, so no, there is not a method I am aware of to extract the data from the records instead of removing the noise.

Re: kennethk (Re: Recommendations for efficient data reduction/substitution application):
At that point, I have broken the record up into parts in a hash called %entry (which includes other things such as the host logging the message, the time stamp, etc.). While I would love to be able to pull the entire data set into memory and run through the 100+ regexes one time, the combined size of the logs to process (several have exceeded 5GB in size so far) discourages the attempt. (Unless there is another way that has not come to mind yet.)

Re: shmem (Re: Recommendations for efficient data reduction/substitution application):
In this case, it becomes "I don't care about this, this, or that, but I need the rest of it." I had not thought about study(), however. I will look into that.

Thank you all for your input and assistance.

Comment on Re: Recommendations for efficient data reduction/substitution application

Replies are listed 'Best First'.
Re^2: Recommendations for efficient data reduction/substitution application by kennethk (Abbot) on Mar 03, 2015 at 21:20 UTC
If you don't have backreferences in your expressions, the actions of regular expressions are independent (as opposed to sequential) and your usage is sparse, you could precompile the world's ugliest regular expression: `my $main_re = do { local $" = ')\|)(?=.?('; my @REs = map $_->{from}, @conversions; qr/^(?=.?(@conversions)\|)/; };` [download] Called in array context, the result will be defined for all the indices where your regular expression matched. Thus allowing: `my @scan = $entry{$k} =~ $main_re; for my $i (0..$#scan) { next unless defined $scan[$i]; $entry{$k} =~ s/$conversions[$i]{from}/$conversions[$i]{to}/g; conversions[$i]{count}++; }` [download] With backreferences, it's still possible but you'll have to map offsets. With dependencies between results, you could do something similar, but would have to map out Markov chains. Probably more suggestions to be had with more details on the system. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Recommendations for efficient data reduction/substitution application
by kennethk (Abbot) on Mar 03, 2015 at 21:20 UTC

my $main_re = do {
    local $" = ')|)(?=.*?(';
    my @REs = map $_->{from}, @conversions;
    qr/^(?=.*?(@conversions)|)/;
};
[download]

  my @scan = $entry{$k} =~ $main_re;
  for my $i (0..$#scan) {
    next unless defined $scan[$i];
    $entry{$k} =~ s/$conversions[$i]{from}/$conversions[$i]{to}/g;
    conversions[$i]{count}++;
  }
[download]

Probably more suggestions to be had with more details on the system.

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

[reply]
[d/l]
[select]