in reply to Write to multiple files according to multiple regex

Foodeywo:

You don't show the script in one chunk, so I can't tell if you've got a logic error or not. But I hacked a quickie together, and threw it at a large file (32GB), and it run in a little over 3 minutes:

#!/usr/bin/env perl # # search a large file for lines containing a regex # use strict; use warnings; use Data::Dump 'pp'; my @rexlist; my $cnt=0; while (<DATA>) { next if /^\s*($|#)/; s/\s+$//; my ($name, $rex) = split /:/, $_; my $regex = qr/$rex/; ++$cnt; open my $FH, '>', "FILESRCH.$cnt" or die $!; push @rexlist, [ $regex, $name, $FH ]; } open my $IFH, '<', "a_big_file" or die "$!"; $cnt =0; my %cnts; my $lines=0; my $start = time; while (my $line = <$IFH>) { ++$cnt; ++$lines; if ($lines % 100000 == 0) { my $secs = time - $start; print "$lines: $secs s\n"; } #last if $cnt>50; #print "$.: $line"; my $matches = 0; for my $r (@rexlist) { my ($rex, $name, $OFH) = @$r; if ($line =~ $rex) { print $OFH $line; #print "match $matches ($name)\n"; ++$cnts{$name}; } ++$matches; } #print "\n"; } print pp(\%cnts); __DATA__ aNumber:'\d+' CorporateRecord:'CORPORATE' null:NULL oldRec:'200[0-3]-\d\d-\d\d newRec:'20?[4-9]-\d\d-\d\d newRec2: '201\d-\d\d-\d\d

I can only imagine that you have a logic error, or some particularly slow regexes to make your program run that slowly. The output from mine:

$ time perl large_file_regex_search.pl 100000: 1 s 200000: 2 s 300000: 3 s 400000: 4 s 500000: 5 s 600000: 5 s 700000: 6 s 800000: 7 s 900000: 8 s 1000000: 10 s 1100000: 15 s 1200000: 18 s 1300000: 20 s 1400000: 23 s 1500000: 25 s 1600000: 29 s 1700000: 35 s 1800000: 42 s 1900000: 47 s 2000000: 53 s 2100000: 60 s 2200000: 66 s 2300000: 71 s 2400000: 75 s 2500000: 81 s 2600000: 87 s 2700000: 92 s 2800000: 98 s 2900000: 103 s 3000000: 107 s 3100000: 113 s 3200000: 119 s 3300000: 124 s 3400000: 129 s 3500000: 135 s 3600000: 142 s 3700000: 151 s 3800000: 158 s 3900000: 166 s 4000000: 173 s 4100000: 181 s { aNumber => 4140847, CorporateRecord => 149943, newRec2 => 783275, null => 4140847, oldRec => 987898, } real 3m5.660s user 1m6.390s sys 0m16.875s $ $ ls -al FI* -rw-r--r-- 1 Roboticus None 1261 May 30 12:12 FILES.ddl.sql -rw-r--r-- 1 Roboticus None 3248770142 Jul 21 08:47 FILESRCH.1 -rw-r--r-- 1 Roboticus None 116430098 Jul 21 08:47 FILESRCH.2 -rw-r--r-- 1 Roboticus None 3248770142 Jul 21 08:47 FILESRCH.3 -rw-r--r-- 1 Roboticus None 769188466 Jul 21 08:47 FILESRCH.4 -rw-r--r-- 1 Roboticus None 0 Jul 21 08:44 FILESRCH.5 -rw-r--r-- 1 Roboticus None 613214364 Jul 21 08:47 FILESRCH.6

At first, I thought that perhaps your ranges were too large and you were doing a lot of disk writing (which may be true), but two of the expressions in my list are on every input line, so FILESRCH.1 and FILESRCH.3 are exact copies of the input file. Post your entire script and some sample regexes so we can see where the difficulty lies.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: Write to multiple files according to multiple regex
by Foodeywo (Novice) on Jul 21, 2015 at 13:30 UTC

    Thanks!

    The code is huge, a little hard for me to understand every line. What I cannot figure out in your code is, how $OFH can write to different files. Its not defined anywhere is it?

    My code changed a bit after the many suggestions here and now looks like that:

    #!perl use strict; use warnings; use FindBin; my (@regex, $regex,$file,$outfile,$dir,$dh,@inputs,$inputs,@filehandle +s,$fh,$ofh); $dir ="$FindBin::Bin/../rxo"; opendir($dh, $dir) || die "can't opendir $dir: $!"; @inputs = readdir($dh); closedir $dh; splice @inputs, 0, 2; foreach(@inputs) { #localize the file glob, so FILE is unique to # the inner loop. local *FILE; local *OUTFILE; $file = "$FindBin::Bin/../rxo/$_"; $outfile = "$FindBin::Bin/../blocks/$_"; open(*FILE, "$file") || die; open(*OUTFILE, "> $outfile") || die; #push the typeglobe to the end of the array $fh = \*FILE; $ofh = \*OUTFILE; $regex = <$fh>; push(@regex,$regex); push(@filehandles,$ofh); } $/ = '^END$'; while(my $line = <>) { for my $i(0..$#inputs) { print {$filehandles[$i]} $line if $line =~ /$regex[$i]/; } }

    My regexes look like this:

    (?^:^UT A19(?:7(?:0G990800007|6CQ89200006)|8(?:0JW32900007|2PN88100001)|90DD63700001))

    Basically the data is arranged in blocks like:

    UT xxxxxx (some number), lets call this the entry
    some data about the entry
    some more data about the entry
    END
    UT xxxxx2 (next entry)
    ...

    So i want to extract 1) all blocks if interest, 2) split these blocks in n files since these blocks relate to n different regexes

      Foodeywo:

      Regarding your question how $OFH writes to different files. I do it by building an array containing: (1) The name of the regular expression, (2) the regular expression, and (3) the output file handle using this code:

      while (<DATA>) { . . . create $name, $rex and $FH . . . push @rexlist, [ $regex, $name, $FH ]; }

      Then as we process the input file, we scan through our regular expressions, and for each one, we pull the regex, name and output file handle out of the array:

      while (my $line = <$IFH>) { . . . # For each regular expression for my $r (@rexlist) { # Pull the regular expression, name and file handle out of our + array my ($rex, $name, $OFH) = @$r; # If the line matches the regex, write it to the file if ($line =~ $rex) { print $OFH $line; } } . . . }

      Feel free to ask again if you need a bit more clarification.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

      I can suggest several improvements to the code you have posted.

      Declare all variables in the smallest possible scope. Your declaration of all variables at the start of the file largely defeats your use of strict.

      Lexical file handles are much easier to manage than globs.

      The three argument form of open would make the intention clearer.

      Storing your file data in an array of hashes rather than in parallel arrays probably would not make any difference in speed, but it would help your readers by keeping related data together.

      Store you regexes as regexes (use qr//) rather than strings. It is probably faster, and it certainly makes the intention clearer.

      Note: The $INPUT_RECORD_SEPARATOR is a string not a regex.

      UNTESTED

      #!perl use strict; use warnings; use FindBin; my $dir = "$FindBin::Bin/../rxo"; opendir( my $dh, $dir ) || die "can't opendir $dir: $!"; my @inputs = readdir($dh); closedir $dh; splice @inputs, 0, 2; my @dispatch; foreach (@inputs) { my $outfile = "$FindBin::Bin/../blocks/$_"; open my $ofh, '>', $outfile || die; my $file = "$FindBin::Bin/../rxo/$_"; open my $fh, '<', $file || die; my $regex = <$fh>; close $fh; push @dispatch, { file => $ofh, regex => qr/$regex/ }; } while ( my $line = do{ local $/ = 'END'; <> } ) { foreach (@dispatch) { print { $_->{file} } $line if $line =~ $_->{regex}; } }
      Bill
        thank you very much! this runs and is much faster. however I have problems with the $/. It stops after the first match was found. So i get 1 entry in 1 File, and the rest of the file remains empty.