Re: Write to multiple files according to multiple regex

Foodeywo:

You don't show the script in one chunk, so I can't tell if you've got a logic error or not. But I hacked a quickie together, and threw it at a large file (32GB), and it run in a little over 3 minutes:

#!/usr/bin/env perl
#
#   search a large file for lines containing a regex
#
use strict;
use warnings;
use Data::Dump 'pp';

my @rexlist;
my $cnt=0;
while (<DATA>) {
    next if /^\s*($|#)/;
    s/\s+$//;
    my ($name, $rex) = split /:/, $_;
    my $regex = qr/$rex/;
    ++$cnt;
    open my $FH, '>', "FILESRCH.$cnt" or die $!;
    push @rexlist, [ $regex, $name, $FH ];
}

open my $IFH, '<', "a_big_file" or die "$!";
$cnt =0;
my %cnts;
my $lines=0;
my $start = time;
while (my $line = <$IFH>) {
    ++$cnt;
    ++$lines;
    if ($lines % 100000 == 0) {
        my $secs = time - $start;
        print "$lines: $secs s\n";
    }
    #last if $cnt>50;
    #print "$.: $line";
    my $matches = 0;
    for my $r (@rexlist) {
        my ($rex, $name, $OFH) = @$r;
        if ($line =~ $rex) {
            print $OFH $line;
            #print "match $matches ($name)\n";
            ++$cnts{$name};
        }
        ++$matches;
    }
    #print "\n";
}

print pp(\%cnts);
__DATA__
aNumber:'\d+'
CorporateRecord:'CORPORATE'
null:NULL
oldRec:'200[0-3]-\d\d-\d\d
newRec:'20?[4-9]-\d\d-\d\d
newRec2: '201\d-\d\d-\d\d
[download]

I can only imagine that you have a logic error, or some particularly slow regexes to make your program run that slowly. The output from mine:

$ time perl large_file_regex_search.pl
100000: 1 s
200000: 2 s
300000: 3 s
400000: 4 s
500000: 5 s
600000: 5 s
700000: 6 s
800000: 7 s
900000: 8 s
1000000: 10 s
1100000: 15 s
1200000: 18 s
1300000: 20 s
1400000: 23 s
1500000: 25 s
1600000: 29 s
1700000: 35 s
1800000: 42 s
1900000: 47 s
2000000: 53 s
2100000: 60 s
2200000: 66 s
2300000: 71 s
2400000: 75 s
2500000: 81 s
2600000: 87 s
2700000: 92 s
2800000: 98 s
2900000: 103 s
3000000: 107 s
3100000: 113 s
3200000: 119 s
3300000: 124 s
3400000: 129 s
3500000: 135 s
3600000: 142 s
3700000: 151 s
3800000: 158 s
3900000: 166 s
4000000: 173 s
4100000: 181 s
{
  aNumber => 4140847,
  CorporateRecord => 149943,
  newRec2 => 783275,
  null => 4140847,
  oldRec => 987898,
}
real    3m5.660s
user    1m6.390s
sys     0m16.875s

$ $ ls -al FI*
-rw-r--r-- 1 Roboticus None       1261 May 30 12:12 FILES.ddl.sql
-rw-r--r-- 1 Roboticus None 3248770142 Jul 21 08:47 FILESRCH.1
-rw-r--r-- 1 Roboticus None  116430098 Jul 21 08:47 FILESRCH.2
-rw-r--r-- 1 Roboticus None 3248770142 Jul 21 08:47 FILESRCH.3
-rw-r--r-- 1 Roboticus None  769188466 Jul 21 08:47 FILESRCH.4
-rw-r--r-- 1 Roboticus None          0 Jul 21 08:44 FILESRCH.5
-rw-r--r-- 1 Roboticus None  613214364 Jul 21 08:47 FILESRCH.6
[download]

At first, I thought that perhaps your ranges were too large and you were doing a lot of disk writing (which may be true), but two of the expressions in my list are on every input line, so FILESRCH.1 and FILESRCH.3 are exact copies of the input file. Post your entire script and some sample regexes so we can see where the difficulty lies.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Comment on Re: Write to multiple files according to multiple regex Select or Download Code

Replies are listed 'Best First'.
Re^2: Write to multiple files according to multiple regex by Foodeywo (Novice) on Jul 21, 2015 at 13:30 UTC
Thanks! The code is huge, a little hard for me to understand every line. What I cannot figure out in your code is, how $OFH can write to different files. Its not defined anywhere is it? My code changed a bit after the many suggestions here and now looks like that: #!perl use strict; use warnings; use FindBin; my (@regex, $regex,$file,$outfile,$dir,$dh,@inputs,$inputs,@filehandle +s,$fh,$ofh); $dir ="$FindBin::Bin/../rxo"; opendir($dh, $dir) \|\| die "can't opendir $dir: $!"; @inputs = readdir($dh); closedir $dh; splice @inputs, 0, 2; foreach(@inputs) { #localize the file glob, so FILE is unique to # the inner loop. local FILE; local OUTFILE; $file = "$FindBin::Bin/../rxo/$_"; $outfile = "$FindBin::Bin/../blocks/$_"; open(FILE, "$file") \|\| die; open(OUTFILE, "> $outfile") \|\| die; #push the typeglobe to the end of the array $fh = \FILE; $ofh = \OUTFILE; $regex = <$fh>; push(@regex,$regex); push(@filehandles,$ofh); } $/ = '^END$'; while(my $line = <>) { for my $i(0..$#inputs) { print {$filehandles[$i]} $line if $line =~ /$regex[$i]/; } } [download] My regexes look like this: (?^:^UT A19(?:7(?:0G990800007\|6CQ89200006)\|8(?:0JW32900007\|2PN88100001)\|90DD63700001)) Basically the data is arranged in blocks like: UT xxxxxx (some number), lets call this the entry some data about the entry some more data about the entry END UT xxxxx2 (next entry) ... So i want to extract 1) all blocks if interest, 2) split these blocks in n files since these blocks relate to n different regexes	[reply] [d/l]
Re^3: Write to multiple files according to multiple regex by roboticus (Chancellor) on Jul 21, 2015 at 20:29 UTC
Foodeywo: Regarding your question how $OFH writes to different files. I do it by building an array containing: (1) The name of the regular expression, (2) the regular expression, and (3) the output file handle using this code: `while (<DATA>) { . . . create $name, $rex and $FH . . . push @rexlist, [ $regex, $name, $FH ]; }` [download] Then as we process the input file, we scan through our regular expressions, and for each one, we pull the regex, name and output file handle out of the array: `while (my $line = <$IFH>) { . . . # For each regular expression for my $r (@rexlist) { # Pull the regular expression, name and file handle out of our + array my ($rex, $name, $OFH) = @$r; # If the line matches the regex, write it to the file if ($line =~ $rex) { print $OFH $line; } } . . . }` [download] Feel free to ask again if you need a bit more clarification. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^3: Write to multiple files according to multiple regex by BillKSmith (Monsignor) on Jul 21, 2015 at 16:05 UTC
I can suggest several improvements to the code you have posted. Declare all variables in the smallest possible scope. Your declaration of all variables at the start of the file largely defeats your use of strict. Lexical file handles are much easier to manage than globs. The three argument form of open would make the intention clearer. Storing your file data in an array of hashes rather than in parallel arrays probably would not make any difference in speed, but it would help your readers by keeping related data together. Store you regexes as regexes (use qr//) rather than strings. It is probably faster, and it certainly makes the intention clearer. Note: The $INPUT_RECORD_SEPARATOR is a string not a regex. UNTESTED #!perl use strict; use warnings; use FindBin; my $dir = "$FindBin::Bin/../rxo"; opendir( my $dh, $dir ) \|\| die "can't opendir $dir: $!"; my @inputs = readdir($dh); closedir $dh; splice @inputs, 0, 2; my @dispatch; foreach (@inputs) { my $outfile = "$FindBin::Bin/../blocks/$_"; open my $ofh, '>', $outfile \|\| die; my $file = "$FindBin::Bin/../rxo/$_"; open my $fh, '<', $file \|\| die; my $regex = <$fh>; close $fh; push @dispatch, { file => $ofh, regex => qr/$regex/ }; } while ( my $line = do{ local $/ = 'END'; <> } ) { foreach (@dispatch) { print { $_->{file} } $line if $line =~ $_->{regex}; } } [download] Bill	[reply] [d/l]
Re^4: Write to multiple files according to multiple regex by Foodeywo (Novice) on Jul 21, 2015 at 18:46 UTC
thank you very much! this runs and is much faster. however I have problems with the $/. It stops after the first match was found. So i get 1 entry in 1 File, and the rest of the file remains empty.	[reply]
Re^5: Write to multiple files according to multiple regex by BillKSmith (Monsignor) on Jul 21, 2015 at 20:38 UTC
Re^6: Write to multiple files according to multiple regex by Foodeywo (Novice) on Jul 21, 2015 at 21:08 UTC
Some notes below your chosen depth have not been shown here