Foodeywo has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

The problem I need to solve is as follows:

I have an unknown number of optimized (assembled) regular expressions stored in separate files. Now I want to parse one big file and compare each line to all of those regular expressions. If a match is found, the line should be written to a file that corresponds to the regex. E.g. if I have 20 regex, I should end up with 20 Files. The reason I want this to happen simultaneously is, that the big file is really big (~10GB), so I don't want to go through it 20 times (however this might even be faster, as I fear, the filehandling withing the while(<>) loop eats up all the performance).

My approach was to readdir and push all regex to an array, as well as all the filehandles.

foreach(@inputs) { local *FILE; $file = "$FindBin::Bin/../rxo/$_"; $outfile = "$FindBin::Bin/../blocks/$_"; open(FILE, "$file") || die; open(OUTFILE, ">> $outfile") || die; $fh = \*FILE; $ofh = \*OUTFILE; $regex = <$fh>; push(@regex,$regex); push(@filehandles,$ofh); }

That looks ok, now the tricky part is to parse the big file, I tried the following:

my $i = 0; while(<>) { foreach(@inputs) { print @filehandles[$i] if (/@regex[$i]/../^END_OF_BLOCK/); $i++; } $i=0; }

This runs for 12 hours now but should take between 2 and 6 hours according to my calculations. Furthermore it didnt write anything to a file yet (maybe this happens not before the script is done and filehandles are closed). I think one problem is, that the filehandling stuff is inside while(<>) or the foreach() slows down the whole thing. Also I am not even sure, my print-line is correct. Notice, that I not only want to save the matching line but the complete block of lines starting with the regex match and ending with a "end of block" line. Furthermore I believe there are more elegant ways to solve the problem(s). Thanks in advance for any thoughts!

Update/Solution

There were several problems with my code and I finally was able to figure out all of them with your help

my final code looks like this

#!perl use strict; use warnings; use FindBin; my $dir ="$FindBin::Bin/../rxo"; opendir(my $dh, $dir) || die "can't opendir $dir: $!"; my @inputs = readdir($dh); closedir $dh; splice @inputs, 0, 2; my @dispatch; foreach(@inputs) { my $outfile = "$FindBin::Bin/../blocks/$_"; #open(FILE, "$file") || die; open my $ofh, '>', $outfile || die; my $file = "$FindBin::Bin/../rxo/$_"; open my $fh, '<', $file || die;; my $regex = <$fh>; close $fh; push @dispatch, { file => $ofh, regex => $regex }; } while(my $line = do { local $/ = 'THE_END'; <> }) { foreach (@dispatch) { print { $_->{file} } $line if $line =~ $_->{regex}; print $line if $line =~ $_->{regex}; } }

this can be mark as solved. thanks a lot to everyone!

Replies are listed 'Best First'.
Re: Write to multiple files according to multiple regex
by Laurent_R (Canon) on Jul 21, 2015 at 12:10 UTC
    Just one of your code lines to comment:
    print @filehandles[$i] if (/@regex[$i]/../^END_OF_BLOCK/);
    First, at the very least, to refer to elements of an array, this should be:
    print $filehandles[$i] if (/$regex[$i]/../^END_OF_BLOCK/);
    Second, I do not really see the point of the ../^END_OF_BLOCK/ part in your context.

    Finally, I cannot test right now, but I don't think that:

    print $filehandles[$i] $_;
    is going to work properly. I think you probably need something like this:
    print {$filehandles[$i]} $_;
    Otherwise, some of the errors that you have would be picked up by the compiler if you had used the following pragmas:
    use strict; use warnings;
    at or near the top of your script file.
Re: Write to multiple files according to multiple regex
by roboticus (Chancellor) on Jul 21, 2015 at 12:55 UTC

    Foodeywo:

    You don't show the script in one chunk, so I can't tell if you've got a logic error or not. But I hacked a quickie together, and threw it at a large file (32GB), and it run in a little over 3 minutes:

    #!/usr/bin/env perl # # search a large file for lines containing a regex # use strict; use warnings; use Data::Dump 'pp'; my @rexlist; my $cnt=0; while (<DATA>) { next if /^\s*($|#)/; s/\s+$//; my ($name, $rex) = split /:/, $_; my $regex = qr/$rex/; ++$cnt; open my $FH, '>', "FILESRCH.$cnt" or die $!; push @rexlist, [ $regex, $name, $FH ]; } open my $IFH, '<', "a_big_file" or die "$!"; $cnt =0; my %cnts; my $lines=0; my $start = time; while (my $line = <$IFH>) { ++$cnt; ++$lines; if ($lines % 100000 == 0) { my $secs = time - $start; print "$lines: $secs s\n"; } #last if $cnt>50; #print "$.: $line"; my $matches = 0; for my $r (@rexlist) { my ($rex, $name, $OFH) = @$r; if ($line =~ $rex) { print $OFH $line; #print "match $matches ($name)\n"; ++$cnts{$name}; } ++$matches; } #print "\n"; } print pp(\%cnts); __DATA__ aNumber:'\d+' CorporateRecord:'CORPORATE' null:NULL oldRec:'200[0-3]-\d\d-\d\d newRec:'20?[4-9]-\d\d-\d\d newRec2: '201\d-\d\d-\d\d

    I can only imagine that you have a logic error, or some particularly slow regexes to make your program run that slowly. The output from mine:

    $ time perl large_file_regex_search.pl 100000: 1 s 200000: 2 s 300000: 3 s 400000: 4 s 500000: 5 s 600000: 5 s 700000: 6 s 800000: 7 s 900000: 8 s 1000000: 10 s 1100000: 15 s 1200000: 18 s 1300000: 20 s 1400000: 23 s 1500000: 25 s 1600000: 29 s 1700000: 35 s 1800000: 42 s 1900000: 47 s 2000000: 53 s 2100000: 60 s 2200000: 66 s 2300000: 71 s 2400000: 75 s 2500000: 81 s 2600000: 87 s 2700000: 92 s 2800000: 98 s 2900000: 103 s 3000000: 107 s 3100000: 113 s 3200000: 119 s 3300000: 124 s 3400000: 129 s 3500000: 135 s 3600000: 142 s 3700000: 151 s 3800000: 158 s 3900000: 166 s 4000000: 173 s 4100000: 181 s { aNumber => 4140847, CorporateRecord => 149943, newRec2 => 783275, null => 4140847, oldRec => 987898, } real 3m5.660s user 1m6.390s sys 0m16.875s $ $ ls -al FI* -rw-r--r-- 1 Roboticus None 1261 May 30 12:12 FILES.ddl.sql -rw-r--r-- 1 Roboticus None 3248770142 Jul 21 08:47 FILESRCH.1 -rw-r--r-- 1 Roboticus None 116430098 Jul 21 08:47 FILESRCH.2 -rw-r--r-- 1 Roboticus None 3248770142 Jul 21 08:47 FILESRCH.3 -rw-r--r-- 1 Roboticus None 769188466 Jul 21 08:47 FILESRCH.4 -rw-r--r-- 1 Roboticus None 0 Jul 21 08:44 FILESRCH.5 -rw-r--r-- 1 Roboticus None 613214364 Jul 21 08:47 FILESRCH.6

    At first, I thought that perhaps your ranges were too large and you were doing a lot of disk writing (which may be true), but two of the expressions in my list are on every input line, so FILESRCH.1 and FILESRCH.3 are exact copies of the input file. Post your entire script and some sample regexes so we can see where the difficulty lies.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thanks!

      The code is huge, a little hard for me to understand every line. What I cannot figure out in your code is, how $OFH can write to different files. Its not defined anywhere is it?

      My code changed a bit after the many suggestions here and now looks like that:

      #!perl use strict; use warnings; use FindBin; my (@regex, $regex,$file,$outfile,$dir,$dh,@inputs,$inputs,@filehandle +s,$fh,$ofh); $dir ="$FindBin::Bin/../rxo"; opendir($dh, $dir) || die "can't opendir $dir: $!"; @inputs = readdir($dh); closedir $dh; splice @inputs, 0, 2; foreach(@inputs) { #localize the file glob, so FILE is unique to # the inner loop. local *FILE; local *OUTFILE; $file = "$FindBin::Bin/../rxo/$_"; $outfile = "$FindBin::Bin/../blocks/$_"; open(*FILE, "$file") || die; open(*OUTFILE, "> $outfile") || die; #push the typeglobe to the end of the array $fh = \*FILE; $ofh = \*OUTFILE; $regex = <$fh>; push(@regex,$regex); push(@filehandles,$ofh); } $/ = '^END$'; while(my $line = <>) { for my $i(0..$#inputs) { print {$filehandles[$i]} $line if $line =~ /$regex[$i]/; } }

      My regexes look like this:

      (?^:^UT A19(?:7(?:0G990800007|6CQ89200006)|8(?:0JW32900007|2PN88100001)|90DD63700001))

      Basically the data is arranged in blocks like:

      UT xxxxxx (some number), lets call this the entry
      some data about the entry
      some more data about the entry
      END
      UT xxxxx2 (next entry)
      ...

      So i want to extract 1) all blocks if interest, 2) split these blocks in n files since these blocks relate to n different regexes

        Foodeywo:

        Regarding your question how $OFH writes to different files. I do it by building an array containing: (1) The name of the regular expression, (2) the regular expression, and (3) the output file handle using this code:

        while (<DATA>) { . . . create $name, $rex and $FH . . . push @rexlist, [ $regex, $name, $FH ]; }

        Then as we process the input file, we scan through our regular expressions, and for each one, we pull the regex, name and output file handle out of the array:

        while (my $line = <$IFH>) { . . . # For each regular expression for my $r (@rexlist) { # Pull the regular expression, name and file handle out of our + array my ($rex, $name, $OFH) = @$r; # If the line matches the regex, write it to the file if ($line =~ $rex) { print $OFH $line; } } . . . }

        Feel free to ask again if you need a bit more clarification.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        I can suggest several improvements to the code you have posted.

        Declare all variables in the smallest possible scope. Your declaration of all variables at the start of the file largely defeats your use of strict.

        Lexical file handles are much easier to manage than globs.

        The three argument form of open would make the intention clearer.

        Storing your file data in an array of hashes rather than in parallel arrays probably would not make any difference in speed, but it would help your readers by keeping related data together.

        Store you regexes as regexes (use qr//) rather than strings. It is probably faster, and it certainly makes the intention clearer.

        Note: The $INPUT_RECORD_SEPARATOR is a string not a regex.

        UNTESTED

        #!perl use strict; use warnings; use FindBin; my $dir = "$FindBin::Bin/../rxo"; opendir( my $dh, $dir ) || die "can't opendir $dir: $!"; my @inputs = readdir($dh); closedir $dh; splice @inputs, 0, 2; my @dispatch; foreach (@inputs) { my $outfile = "$FindBin::Bin/../blocks/$_"; open my $ofh, '>', $outfile || die; my $file = "$FindBin::Bin/../rxo/$_"; open my $fh, '<', $file || die; my $regex = <$fh>; close $fh; push @dispatch, { file => $ofh, regex => qr/$regex/ }; } while ( my $line = do{ local $/ = 'END'; <> } ) { foreach (@dispatch) { print { $_->{file} } $line if $line =~ $_->{regex}; } }
        Bill
Re: Write to multiple files according to multiple regex
by Monk::Thomas (Friar) on Jul 21, 2015 at 11:32 UTC

    This code is untested, but maybe it already works

    # 1. elide manually managed count variable # 2. read current line into an actual variable while(my $line = <$fh_bigfile>) { # 3. let perl manage the count variable for my $i (0..$#inputs) { # 4. wrong sigil, use $...[$i] instead of @...[$i] # 5. explicitely refer to the line to print it # 6. enclose file handle in braces to make it more obvious print {$filehandles[$i]} $line if (/$regex[$i]/../^END_OF_BLOCK/); } }

    The most important changes are 2., 4. and 5.

      Thanks! Refering explicitly to $line is the key it seems, since within foreach, print refers to the elements of @inputs by default.

      I also added $line=~ to make it work. Curly braces (# 6.) where also neccessary.

      Now it writes all matches, but it puts all matches into the last filehandler only, throwing

      "Use of uninitialized value $_ in pattern match (m//) at parser.pl line 60, <> line 4401."

      for every line

        The conditional to print looked rather fishy, but I forgot about it. Try:

        print {$filehandles[$i]} $line - if (/$regex[$i]/../^END_OF_BLOCK/); + if ($line =~ /$regex[$i]/../^END_OF_BLOCK/);

        or did you already do exactly that? Please show your updated code.

Re: Write to multiple files according to multiple regex
by BillKSmith (Monsignor) on Jul 21, 2015 at 12:33 UTC
    If the format of your input file allows it, you could work with blocks instead of lines by setting $INPUT_RECORD_SEPARATOR ($/) to '^END_OF_BLOCK'.
    Bill

      nice hint thanks, I didn't know that thing.

      I did this:

      $/ = '^END';

      and within the while

      print {$filehandles[$i]} $line if $line=~/$regex[$i]/;

      this seems to be much faster (i guess this is all it is about, right?

      Last problem remaining is that everything is written to the last filehandler. I cant see why that is the case.

        Last problem remaining is that everything is written to the last filehandler. I cant see why that is the case.

        It looks to me like you are overwriting $ofh on each iteration of the first loop where you set up the array of handles. This would mean that every entry in your array points to the same file (the last file, of course). If you use a local variable inside that loop you may solve the problem. HTH.

        aha. i just noticed. it does not work. using the really big file gives me an "out of memory" after a few seconds. maybe I used it the wrong way?

        update: it must be "END" not "^END". with "^" it just prints everything from first match to the end of the file. however if i use

        $/ = 'END';

        it prints the first match (correctly) and (meanwhile) to the correct file, but it stops parsing afterwards.

Re: Write to multiple files according to multiple regex
by AnomalousMonk (Archbishop) on Jul 22, 2015 at 11:07 UTC
Re: Write to multiple files according to multiple regex
by hippo (Archbishop) on Jul 21, 2015 at 11:19 UTC

    You are not resetting the value of $i inside the while loop. It will therefore go on incrementing unchecked and extend far beyond the ends of your arrays. Move the initialisation of $i inside the loop to avoid this. Update: forget this - I missed the re-zero at the end :-/

    And try testing with a smaller data file for starters.

Re: Write to multiple files according to multiple regex
by BillKSmith (Monsignor) on Jul 23, 2015 at 14:18 UTC

    Thanks for the update.

    Note: Your original regex would work as intended under the /m option. (It changes the meaning of '^' from 'start-of-string' to 'start-of-line'.)

    Bill