Re: Write to multiple files according to multiple regex
by Laurent_R (Canon) on Jul 21, 2015 at 12:10 UTC
|
Just one of your code lines to comment:
print @filehandles[$i] if (/@regex[$i]/../^END_OF_BLOCK/);
First, at the very least, to refer to elements of an array, this should be:
print $filehandles[$i] if (/$regex[$i]/../^END_OF_BLOCK/);
Second, I do not really see the point of the ../^END_OF_BLOCK/ part in your context.
Finally, I cannot test right now, but I don't think that:
print $filehandles[$i] $_;
is going to work properly. I think you probably need something like this:
print {$filehandles[$i]} $_;
Otherwise, some of the errors that you have would be picked up by the compiler if you had used the following pragmas:
use strict;
use warnings;
at or near the top of your script file.
| [reply] [d/l] [select] |
Re: Write to multiple files according to multiple regex
by roboticus (Chancellor) on Jul 21, 2015 at 12:55 UTC
|
Foodeywo:
You don't show the script in one chunk, so I can't tell if you've got a logic error or not. But I hacked a quickie together, and threw it at a large file (32GB), and it run in a little over 3 minutes:
#!/usr/bin/env perl
#
# search a large file for lines containing a regex
#
use strict;
use warnings;
use Data::Dump 'pp';
my @rexlist;
my $cnt=0;
while (<DATA>) {
next if /^\s*($|#)/;
s/\s+$//;
my ($name, $rex) = split /:/, $_;
my $regex = qr/$rex/;
++$cnt;
open my $FH, '>', "FILESRCH.$cnt" or die $!;
push @rexlist, [ $regex, $name, $FH ];
}
open my $IFH, '<', "a_big_file" or die "$!";
$cnt =0;
my %cnts;
my $lines=0;
my $start = time;
while (my $line = <$IFH>) {
++$cnt;
++$lines;
if ($lines % 100000 == 0) {
my $secs = time - $start;
print "$lines: $secs s\n";
}
#last if $cnt>50;
#print "$.: $line";
my $matches = 0;
for my $r (@rexlist) {
my ($rex, $name, $OFH) = @$r;
if ($line =~ $rex) {
print $OFH $line;
#print "match $matches ($name)\n";
++$cnts{$name};
}
++$matches;
}
#print "\n";
}
print pp(\%cnts);
__DATA__
aNumber:'\d+'
CorporateRecord:'CORPORATE'
null:NULL
oldRec:'200[0-3]-\d\d-\d\d
newRec:'20?[4-9]-\d\d-\d\d
newRec2: '201\d-\d\d-\d\d
I can only imagine that you have a logic error, or some particularly slow regexes to make your program run that slowly. The output from mine:
$ time perl large_file_regex_search.pl
100000: 1 s
200000: 2 s
300000: 3 s
400000: 4 s
500000: 5 s
600000: 5 s
700000: 6 s
800000: 7 s
900000: 8 s
1000000: 10 s
1100000: 15 s
1200000: 18 s
1300000: 20 s
1400000: 23 s
1500000: 25 s
1600000: 29 s
1700000: 35 s
1800000: 42 s
1900000: 47 s
2000000: 53 s
2100000: 60 s
2200000: 66 s
2300000: 71 s
2400000: 75 s
2500000: 81 s
2600000: 87 s
2700000: 92 s
2800000: 98 s
2900000: 103 s
3000000: 107 s
3100000: 113 s
3200000: 119 s
3300000: 124 s
3400000: 129 s
3500000: 135 s
3600000: 142 s
3700000: 151 s
3800000: 158 s
3900000: 166 s
4000000: 173 s
4100000: 181 s
{
aNumber => 4140847,
CorporateRecord => 149943,
newRec2 => 783275,
null => 4140847,
oldRec => 987898,
}
real 3m5.660s
user 1m6.390s
sys 0m16.875s
$ $ ls -al FI*
-rw-r--r-- 1 Roboticus None 1261 May 30 12:12 FILES.ddl.sql
-rw-r--r-- 1 Roboticus None 3248770142 Jul 21 08:47 FILESRCH.1
-rw-r--r-- 1 Roboticus None 116430098 Jul 21 08:47 FILESRCH.2
-rw-r--r-- 1 Roboticus None 3248770142 Jul 21 08:47 FILESRCH.3
-rw-r--r-- 1 Roboticus None 769188466 Jul 21 08:47 FILESRCH.4
-rw-r--r-- 1 Roboticus None 0 Jul 21 08:44 FILESRCH.5
-rw-r--r-- 1 Roboticus None 613214364 Jul 21 08:47 FILESRCH.6
At first, I thought that perhaps your ranges were too large and you were doing a lot of disk writing (which may be true), but two of the expressions in my list are on every input line, so FILESRCH.1 and FILESRCH.3 are exact copies of the input file. Post your entire script and some sample regexes so we can see where the difficulty lies.
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [d/l] [select] |
|
|
Thanks!
The code is huge, a little hard for me to understand every line. What I cannot figure out in your code is, how $OFH can write to different files. Its not defined anywhere is it?
My code changed a bit after the many suggestions here and now looks like that:
#!perl
use strict;
use warnings;
use FindBin;
my (@regex, $regex,$file,$outfile,$dir,$dh,@inputs,$inputs,@filehandle
+s,$fh,$ofh);
$dir ="$FindBin::Bin/../rxo";
opendir($dh, $dir) || die "can't opendir $dir: $!";
@inputs = readdir($dh);
closedir $dh;
splice @inputs, 0, 2;
foreach(@inputs) {
#localize the file glob, so FILE is unique to
# the inner loop.
local *FILE;
local *OUTFILE;
$file = "$FindBin::Bin/../rxo/$_";
$outfile = "$FindBin::Bin/../blocks/$_";
open(*FILE, "$file") || die;
open(*OUTFILE, "> $outfile") || die;
#push the typeglobe to the end of the array
$fh = \*FILE;
$ofh = \*OUTFILE;
$regex = <$fh>;
push(@regex,$regex);
push(@filehandles,$ofh);
}
$/ = '^END$';
while(my $line = <>) {
for my $i(0..$#inputs) {
print {$filehandles[$i]} $line if $line =~ /$regex[$i]/;
}
}
My regexes look like this:
(?^:^UT A19(?:7(?:0G990800007|6CQ89200006)|8(?:0JW32900007|2PN88100001)|90DD63700001))
Basically the data is arranged in blocks like:
UT xxxxxx (some number), lets call this the entry
some data about the entry
some more data about the entry
END
UT xxxxx2 (next entry)
...
So i want to extract 1) all blocks if interest, 2) split these blocks in n files since these blocks relate to n different regexes | [reply] [d/l] |
|
|
Foodeywo:
Regarding your question how $OFH writes to different files. I do it by building an array containing: (1) The name of the regular expression, (2) the regular expression, and (3) the output file handle using this code:
while (<DATA>) {
. . . create $name, $rex and $FH . . .
push @rexlist, [ $regex, $name, $FH ];
}
Then as we process the input file, we scan through our regular expressions, and for each one, we pull the regex, name and output file handle out of the array:
while (my $line = <$IFH>) {
. . .
# For each regular expression
for my $r (@rexlist) {
# Pull the regular expression, name and file handle out of our
+ array
my ($rex, $name, $OFH) = @$r;
# If the line matches the regex, write it to the file
if ($line =~ $rex) {
print $OFH $line;
}
}
. . .
}
Feel free to ask again if you need a bit more clarification.
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [d/l] [select] |
|
|
I can suggest several improvements to the code you have posted.
Declare all variables in the smallest possible scope. Your declaration of all variables at the start of the file largely defeats your use of strict.
Lexical file handles are much easier to manage than globs.
The three argument form of open would make the intention clearer.
Storing your file data in an array of hashes rather than in parallel arrays probably would not make any difference in speed, but it would help your readers by keeping related data together.
Store you regexes as regexes (use qr//) rather than strings. It is probably faster, and it certainly makes the intention clearer.
Note: The $INPUT_RECORD_SEPARATOR is a string not a regex.
UNTESTED
#!perl
use strict;
use warnings;
use FindBin;
my $dir = "$FindBin::Bin/../rxo";
opendir( my $dh, $dir ) || die "can't opendir $dir: $!";
my @inputs = readdir($dh);
closedir $dh;
splice @inputs, 0, 2;
my @dispatch;
foreach (@inputs) {
my $outfile = "$FindBin::Bin/../blocks/$_";
open my $ofh, '>', $outfile || die;
my $file = "$FindBin::Bin/../rxo/$_";
open my $fh, '<', $file || die;
my $regex = <$fh>;
close $fh;
push @dispatch, { file => $ofh, regex => qr/$regex/ };
}
while ( my $line = do{ local $/ = 'END'; <> } ) {
foreach (@dispatch) {
print { $_->{file} } $line if $line =~ $_->{regex};
}
}
| [reply] [d/l] |
|
|
|
|
|
Re: Write to multiple files according to multiple regex
by Monk::Thomas (Friar) on Jul 21, 2015 at 11:32 UTC
|
This code is untested, but maybe it already works
# 1. elide manually managed count variable
# 2. read current line into an actual variable
while(my $line = <$fh_bigfile>) {
# 3. let perl manage the count variable
for my $i (0..$#inputs) {
# 4. wrong sigil, use $...[$i] instead of @...[$i]
# 5. explicitely refer to the line to print it
# 6. enclose file handle in braces to make it more obvious
print {$filehandles[$i]} $line
if (/$regex[$i]/../^END_OF_BLOCK/);
}
}
The most important changes are 2., 4. and 5. | [reply] [d/l] |
|
|
Thanks! Refering explicitly to $line is the key it seems, since within foreach, print refers to the elements of @inputs by default.
I also added $line=~ to make it work. Curly braces (# 6.) where also neccessary.
Now it writes all matches, but it puts all matches into the last filehandler only, throwing
"Use of uninitialized value $_ in pattern match (m//) at parser.pl line 60, <> line 4401."
for every line
| [reply] |
|
|
print {$filehandles[$i]} $line
- if (/$regex[$i]/../^END_OF_BLOCK/);
+ if ($line =~ /$regex[$i]/../^END_OF_BLOCK/);
or did you already do exactly that? Please show your updated code. | [reply] [d/l] |
|
|
Re: Write to multiple files according to multiple regex
by BillKSmith (Monsignor) on Jul 21, 2015 at 12:33 UTC
|
If the format of your input file allows it, you could work with blocks instead of lines by setting $INPUT_RECORD_SEPARATOR ($/) to '^END_OF_BLOCK'.
| [reply] |
|
|
$/ = '^END';
and within the while
print {$filehandles[$i]} $line if $line=~/$regex[$i]/;
this seems to be much faster (i guess this is all it is about, right?
Last problem remaining is that everything is written to the last filehandler. I cant see why that is the case.
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
|
|
|
|
|
|
$/ = 'END';
it prints the first match (correctly) and (meanwhile) to the correct file, but it stops parsing afterwards. | [reply] [d/l] |
Re: Write to multiple files according to multiple regex
by AnomalousMonk (Archbishop) on Jul 22, 2015 at 11:07 UTC
|
| [reply] [d/l] |
Re: Write to multiple files according to multiple regex
by hippo (Archbishop) on Jul 21, 2015 at 11:19 UTC
|
You are not resetting the value of $i inside the while loop. It will therefore go on incrementing unchecked and extend far beyond the ends of your arrays. Move the initialisation of $i inside the loop to avoid this. Update: forget this - I missed the re-zero at the end :-/
And try testing with a smaller data file for starters.
| [reply] [d/l] [select] |
Re: Write to multiple files according to multiple regex
by BillKSmith (Monsignor) on Jul 23, 2015 at 14:18 UTC
|
Thanks for the update.
Note: Your original regex would work as intended under the /m option. (It changes the meaning of '^' from 'start-of-string' to 'start-of-line'.)
| [reply] |