Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Recommendations for efficient data reduction/substitution application

by atcroft (Abbot)
on Mar 03, 2015 at 18:54 UTC ( [id://1118652]=perlquestion: print w/replies, xml ) Need Help??

atcroft has asked for the wisdom of the Perl Monks concerning the following question:

Can anyone recommend the most efficient method of applying a large number of regular expression substitutions (> 100) to a relatively large number of input data lines (> 10_000_000)?

In a recent project at $work, I wrote a script to generate a report of logged problem occurrences. In this process, I have to simplify the data by removing variations that are "noise" relative to the issues I am looking at (such as a process ID, an amount of free memory when a memory use threshold is crossed, or a part of a message with more detail than needed in this report). Once I've broken the record into a few basic parts, I currently have a list of 100+ regexen, replacement strings, and order of application to use (currently stored in a DB table), and a simple for loop to apply them to each record:

foreach my $regex (@conversions) { if ( $entry{$k} =~ s/$regex->{from}/$regex->{to}/g ) { $regex->{count}++; # Count of applications used to determine if # a particular substitution is warranted. } }
I have the nagging feeling, however, that there is a more efficient way of dealing with this. (Also, this processing takes up the largest percentage of the processing time in the script, which often takes hours to run.) Any thoughts/suggestions?

Thank you for your time and attention, and any direction you may provide.

  • Comment on Recommendations for efficient data reduction/substitution application
  • Download Code

Replies are listed 'Best First'.
Re: Recommendations for efficient data reduction/substitution application
by BrowserUk (Patriarch) on Mar 03, 2015 at 23:20 UTC

    My suggestion comes in two parts:

    1. Throw cpu's at it.

      It doesn't much matter whether they are in the form of threads or processes, so long as the coordination is done right.

      Take the size in bytes of your file, divide it by the number of cores you have at your disposal, and assign a byte range to each core.

      Of course, those byte ranges won't match exactly to lines, but if a process/thread seeks to its start offset, and the uses readline, the first line will likely be a partial, so discard it and read the next.

      Then have your read loop check its file position with tell after each read, and it stops when that position has moved past the end byte of its range. Thus it will process the complete line, including the partial line, that the process with the next range, discarded.

      In (untested) code:

      my( $fh, $start, $end, $n ) = @_; // $n is this process' sequence numb +er seek $fh, $start, 0; <$fh>; // discard first read my $pos = tell $fh; while( $pos < $end ) { my $line = <$fh>; $pos = tell $fh; for my $regex ( @regex ) { $line =~ $regex; } ## write out modifed line. (see below) }
    2. Process multiple lines at a time if your regex allow (or can be rewritten to do so).

      Starting the regex engine a 100 times for each line is expensive. If you can safely rewrite them to process a batch of lines at a time, you can amortise the startup costs.

      (Note. That can be a 'big if', if the re-write makes the regex much more complex, then it can be self defeating; but its worth trying a few tests to find out.)

      Combining this with step 1 above takes thought and care to prevent exhausting your physical memory and moving into swapping.

      Let's say you have a 4-core machine with 4GB physical memory, but a 32-bit perl, thus a max of 2GB per process; and a 5GB file to deal with.

      First divide the file into 4; 0 - 1.25GB-1, 1.25GB - 2.5GB-1, 2.5GB - 3.75GB-1, 3.75GB - 5GB.

      If all 4 processes loaded their full quota, they'd move into swapping, so have them do it in two chunks. 4 * 0.625 = 2.5GB < 4GB.

      So, more untested code:

      my( $fh, $start, $end, $n ) = @_; // $n is this process' sequence numb +er my @ranges = ( [ $start, $end/2-1 ], [ $end/2, $end] ); // a loop and +careful math for more than 2 chunks for my $range ( @ranges ) { my( $start, $end ) = @$range; // use different names if you prefer seek $fh, $start, 0; <$fh>, $start = tell( $fh ) if $start != 0; // discard partial and + get real starting point (if not beginning) seek $fh, $end, 0; <$fh>, $end = tell $fh; // get true end point; seek $fh, $start, 0; read( $fh, my $buffer, $end - $start +1; for my $regex ( @regex ) { $buffer =~ $regex; } ## write out modifed line. (see below) }

    The tricky bit with both schemes is combining the modified chunks back together, as the modifications will have changed their lengths.

    The simplest mechanism is to write separate small files with a naming convention that allows them to be ordered.

    Eg. You have 4 processes, so give each process a number, and have them write to numbered files: infile.dat.part0n, and have the parent process (that allocated and started the kids) wait for them to complete, and then merge the files back together.

    HTH.

    Update: If you have a second physical device available, do your small file writes to that; and then merge them back to the original device.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Recommendations for efficient data reduction/substitution application
by kennethk (Abbot) on Mar 03, 2015 at 20:02 UTC
    There are micro-optimizations that you could implement here, but I would not chose to implement them without a substantial amount of profiling information. In general, you want to apply 100 regexes 10,000,000 times, so you've defined your grid.
    • $entry{$k} looks distinctly like you have a hash where you should have an array
    • s/$regex->{from}/$regex->{to}/g would seem to dereference 100*10,000,000 times when inverting the loops would mean you only have to dereference 100 times
    Nothing Earth shattering, really.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Recommendations for efficient data reduction/substitution application
by shmem (Chancellor) on Mar 03, 2015 at 20:16 UTC

    As a data sculptor, you have no choice than chisel away unwanted chips. As a report constructing artist, you assemble your work from a pool of material.

    That means, it all depends on whether you know what you are looking for, or rather want to remove all the stuff you know about and don't care, to "sieve the problem gold".

    Maybe you can mix. Maybe at some stage of the process, extracting is cheaper, something like

    $entry{$k} =~ /($regex->{extract})/ and $entry{$k} = $1;

    - but it all depends on the input data and the aims of your reports.

    Then, in some forgotten corner of the olde perl operators, there's still study sitting, which might - or not - speed things up.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
      Note that study is a no-op sinde 5.16.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        Thanks for the note. I'm lagging since.. some years? well. Time to get up to date.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Recommendations for efficient data reduction/substitution application
by hdb (Monsignor) on Mar 03, 2015 at 19:26 UTC

    What I do not understand: The application of the substitutions is the process to remove the noise or do you first need to remove the noise and then apply the substitutions?

    And there is no way to extract the desired data from the records instead of removing the noise?

Re: Recommendations for efficient data reduction/substitution application
by atcroft (Abbot) on Mar 03, 2015 at 20:35 UTC

    (Roll-up of answers to the responses thus far, with my appreciation!)


    Re: hdb (Re: Recommendations for efficient data reduction/substitution application):
    Yes, the application of the substitutions is the process to remove the noise. The records may include errors that have not been seen in production before, so no, there is not a method I am aware of to extract the data from the records instead of removing the noise.


    Re: kennethk (Re: Recommendations for efficient data reduction/substitution application):
    At that point, I have broken the record up into parts in a hash called %entry (which includes other things such as the host logging the message, the time stamp, etc.). While I would love to be able to pull the entire data set into memory and run through the 100+ regexes one time, the combined size of the logs to process (several have exceeded 5GB in size so far) discourages the attempt. (Unless there is another way that has not come to mind yet.)


    Re: shmem (Re: Recommendations for efficient data reduction/substitution application):
    In this case, it becomes "I don't care about this, this, or that, but I need the rest of it." I had not thought about study(), however. I will look into that.


    Thank you all for your input and assistance.

      If you don't have backreferences in your expressions, the actions of regular expressions are independent (as opposed to sequential) and your usage is sparse, you could precompile the world's ugliest regular expression:
      my $main_re = do { local $" = ')|)(?=.*?('; my @REs = map $_->{from}, @conversions; qr/^(?=.*?(@conversions)|)/; };
      Called in array context, the result will be defined for all the indices where your regular expression matched. Thus allowing:
      my @scan = $entry{$k} =~ $main_re; for my $i (0..$#scan) { next unless defined $scan[$i]; $entry{$k} =~ s/$conversions[$i]{from}/$conversions[$i]{to}/g; conversions[$i]{count}++; }
      With backreferences, it's still possible but you'll have to map offsets. With dependencies between results, you could do something similar, but would have to map out Markov chains.

      Probably more suggestions to be had with more details on the system.


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Recommendations for efficient data reduction/substitution application
by ipherian (Novice) on Mar 04, 2015 at 09:36 UTC

    Even if you can't load the entire log into memory, loading it in chunks should speed things up.

    1. You can take the hash with the entries and compact a few thousand of them (or however much memory you want to use). OR you can grab a chunk of the log data from a file with newlines, and read to the next line <IN>.

    2. Run search/replace for each regex over the entire buffer, by evaluating the s/foo/bar/g in list context. It can still be found how many replacements were done for each regex.


    Code:

    Get a chunk of data:

    my $bufflen = 4 * 1024; #or w/e do { $result = read ( IN, $buffer, $bufflen-length($buffer), length($buffer) ); } while ( $result && ( length($buffer) < $bufflen ) );

    If the chunk ends in the middle of a line, strip off the remainder and save it for the next chunk: (not needed if each entry is separated beforehand)

    my $newline = "\n"; #or some other unique record separator ## if we're not at eof if ($result > 0) { my $last_newline = rindex $buffer, $newline; my $remainderlen = length($buffer)-$last_newline-length($newline); if ($remainderlen <= 0) { $remainder = ''; } else { $remainder = substr($buffer, $last_newline+length($newline), $remainderlen, ''); } } ## this is important: prefix the remainder before next chunk $buffer = $remainder;
    __EDIT__: Or instead of the above you could just do readline like BrowserUk did. D'oh!
    $buffer .= <IN>;
    Then apply your regexes: (and count how many replacements were done)
    foreach my $regex (@conversions) { my @results = ( $buffer =~ s/$regex->{from}/$regex->{to}/g ); my $reps_done = 0; grep { $reps_done += $_ } @results; $regex->{count} += $reps_done; } ## and do whatever with the result print OUT $buffer;

    The above would be inside a block which loops over each chunk until the end of the log file is reached

Re: Recommendations for efficient data reduction/substitution application
by wollmers (Scribe) on Mar 04, 2015 at 20:13 UTC

    Had a similar problem 10 years ago, i.e. analysing a weblog of ++1 GB/day by matching 3500 regex-patterns against each lines. With hardware at this time (2005) a run was predicted to need more than 24 hours for the log of one day.

    So I created what I called "reverse matching", i.e. take a substring of the log-line and match it against the patterns. If the patterns have a fixed part, which can be anchored, a substring of width $w can be extracted from the log-line, and this substring is a key for a lookup in a hash of arrays of regex-patterns. This pattern-hash could be preprocessed once by extracting a substring of width $w at the same position of the regex-pattern, if there is a literal string there in the pattern.

    The goal of the above algorithm is to reduce the complexity of O(n*p), where p is the number of patterns. If a pattern contains a fixed string, we can use this substring to select (restrict, filter) the set of patterns to apply. In my case it was possible to reduce the number of patterns to try per line from 3500 to 120. It depends on the structure of your data and patterns, if you can use a similar approach.

    As others noted here, there is maybe room for reduction at higher levels, because substitution-regexes at such a high count smell unnecessary, against "Do only what you need." Maybe you can find some nice ideas in the excellent work of Tim Bunce: http://de.slideshare.net/Tim.Bunce/application-logging-in-the-21st-century-2014key

Re: Recommendations for efficient data reduction/substitution application
by LanX (Saint) on Mar 05, 2015 at 21:31 UTC
    You are trying 100+ substitutions at each entry, but how many do you really need in average?

    Are all substitutions potentially necessary for each log line or could you pre-classify them into groups?

    Are there log-lines which don't need any processing at all?

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)

    PS: Je suis Charlie!

Re: Recommendations for efficient data reduction/substitution application
by LanX (Saint) on Mar 05, 2015 at 11:33 UTC
    I solved a roughly similar problem using the trie optimization of the regex engine.

    You need to put the different match clause into a long or-chain.

    The replace-part has to identify which regex matched.

    Further more you should benchmark how expensive replacements are on longer chunks.

    A sliding window technique could be the answer.

    Hmm... Could you identify where the bottleneck really is by benchmarking match and replace independently?

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)

    PS: Je suis Charlie!

      Hm Did they fix that "As it seems that trie optimization brakes for more than 15000 alternative patterns." yet?

      C:\docs\OriginOfSpecies(Darwin)\2009-h>\perl5.18\bin\perl.exe \test\10 +43602.pl 2009-h.htm Finding 203474 words (of 216808) took 0.173504 seconds using a hash Finding 203474 words took 2072.099258 seconds using a trie(via regex +engine)

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1118652]
Approved by Discipulus
Front-paged by Discipulus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-04-23 06:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found