Re: how to parse large files
by Anno (Deacon) on Mar 29, 2007 at 12:30 UTC
|
if ( m/$item/ ) {
so you run all the matches on the current line. That won't make it faster but correcter.
Further, you are re-opening the output file each time there is something to write. Open it once before the loop (simple write mode is okay, no need for append), and just use it in the loop. That will make it faster though there's no way of saying by how much.
Try that. If it's still too slow, go back to optimizing the regex work.
Anno | [reply] [d/l] [select] |
Re: how to parse large files
by rhesa (Vicar) on Mar 29, 2007 at 12:20 UTC
|
Your bottleneck is likely still IO. Two tips:
- Don't open a new Filehandle for each match; instead, open one at the start, and just write a match to it
- Try to eliminate the remaining foreach loop
Splitting the file up and processing it with multiple simultaneous processes isn't going to help much if you're reading off a single disk or a shared bus. | [reply] [d/l] |
Re: how to parse large files
by Moron (Curate) on Mar 29, 2007 at 12:26 UTC
|
Looking at the second example code, it looks like the regexps are being built from scratch on a per use basis, in spite of there being only one call to build them all at the beginning.
The first example seems therefore a better basis to optimise from. And I would be inclined to blame its performance problems on the fact that you are getting a new output filehandle per line which is causing huge object proliferation to take place. You only need the output filehandle to be constructed once, e.g.:
my %lookingFor;
# keys => different name of one subset
# values => array of one subset
my $fh = new FileHandle "< largeLogFile.log";
my $writeFh = new FileHandle ">> myout.log";
while (<$fh>) {
foreach my $subset (keys %lookingFor) {
foreach my $item (@{$subset}) {
if (<$fh> =~ m/$item/) {
print $writeFh <$fh>;
}
}
}
close $fh or die $!;
close $writeFh or die $!;
Update: your code will also fail a "use strict" which should be placed at the beginning. To solve that, construct the hash so that its values are array references rather than array names and the inner loop should change to... foreach my $item ( @{$lookingFor{ $subset }} {
| [reply] [d/l] [select] |
Re: how to parse large files
by dk (Chaplain) on Mar 29, 2007 at 13:04 UTC
|
You're almost there with buildRegexp, but you can speedup it further by removing sub call:
sub build_regexp {
my $expr = join '|', map { "(?:To|is)\:\S+\@$_" } @_;
eval { $expr = qr/$expr/i };
die "aaa!! $@ !!!" if $@;
$expr;
}
(note that I'd suggest changing () grouping into (?:) grouping if you're not using $1).
then, all you need is
my $writeFh = new FileHandle ">> myout.log";
my $is_subset = build_regexp(@allSubSets);
while (<fh>) {
next unless /$is_subset/;
print $writeFh <$fh>;
}
| [reply] [d/l] [select] |
|
|
oh. print $writeFh $_; of course.
| [reply] [d/l] |
Re: how to parse large files
by johngg (Canon) on Mar 29, 2007 at 13:07 UTC
|
If at all possible I would try to go through the combinations of items once at the beginning of the script, taking the process out of the while ( <$fh> ) { ... } loop, building a single regex to match any wanted line. Something like
my @items = ();
foreach my $subset ( keys %lookingFor )
{
foreach my $item ( @$subset )
{
push @items, $item;
}
}
my $rxMatchItems;
{
local $" = q{|};
$rxMatchItems = qr{(?:@items)};
}
As others have pointed out, open the output file once at the beginning of the script then the remainder of the code could be reduced to something like
while ( <$fh> )
{
next unless m{$rxMatchItems};
print $writeFh $_;
}
This approach all depends on whether your items can easily be combined into one regex, but I think it might be more efficient if feasible.
I hope this is of use. Cheers, JohnGG
| [reply] [d/l] [select] |
Re: how to parse large files
by roboticus (Chancellor) on Mar 29, 2007 at 18:06 UTC
|
domcyrus:
In addition to all the other notes you have, you might want to change
the structure of your loop a bit. Rather than search over all the values
in a hash, make an inverted hash and then just do a single lookup for
each value. That way, you needn't execute nested loops.
Specifically:
#!/usr/bin/perl -w
use strict;
use warnings;
my %lookingFor = (
'lazy' => ['lazy', 'tired'],
'entry' => ['entry', 'opening', 'ingress'],
'file' => ['file', 'files', 'filehandle'],
'such' => ['such']
);
# Build an inverted hash with pointers from the individual items
# to the matching key in lookingFor
my %revLUP;
for my $k (%lookingFor) {
for my $v (@{$lookingFor{$k}}) {
$revLUP{$v}=$k;
}
}
while (my $buf = <DATA>) {
# Print line if it has a 'magic word' in it
print $buf if grep { defined $revLUP{$_} }
split /\s+/, $buf;
}
__DATA__
Now is the time for all good men to come to the
aid of their party. The quick red fox jumped
over the lazy brown dog.
[tye]: yes, on Window or on Unix,
the old file is still open so it is just
its directory entry that gets clobbered
[bart]: On Linux, you can unlink a
file and the processes that have the
file open, will still see the contents. I
suspect the same happens here.
[tye]: "busy" only seems to apply to
executable files, talexb. no problem
deleting files that are open (though
Win32 C RTL /defaults/ to locking
the file such that this is prevented)
[blokhead]: in short, the filehandle
is tied to an inode, not a filename
[bart]: Meaning, the directory points
to the new contents, and the old
contents is unlinked (but visible). Is
that correct?
which yields:
$ ./bigfile.pl
over the lazy brown dog.
the old file is still open so it is just
its directory entry that gets clobbered
file and the processes that have the
file open, will still see the contents. I
deleting files that are open (though
the file such that this is prevented)
[blokhead]: in short, the filehandle
$
I don't know if this method will save you any time or not, as I haven't done any
benchmarking. In any case, it may have a lot to do with the number of items in
%lookingFor, the performance of grep, etc. But if this
helps at all, then you can then look for further speedups.
--roboticus
| [reply] [d/l] [select] |
Re: how to parse large files
by planetscape (Chancellor) on Mar 29, 2007 at 14:26 UTC
|
| [reply] |
Re: how to parse large files
by Krambambuli (Curate) on Mar 30, 2007 at 08:32 UTC
|
I'm a bit surprized that none of the answers so far have mentioned or asked about Devel::DProf. So: have you tried to run your script under the profiler ?
In my experience, it might sometimes come as a surprize to see where most of the time is spent. | [reply] |
|
|
| [reply] [d/l] |
Re: how to parse large files
by leocharre (Priest) on Mar 29, 2007 at 13:16 UTC
|
I've read something in the past that really comes to mind here, it's "perl iterators" - it's some way of coding that handles a large task one at a time. I suggest looking into this- first hit on google | [reply] |
|
|
Thanks a lot for all of your answers. First of all the code I've posted was just some pseudo code. I didn't want you to read too much. So in my script I am not opening all the time a filehandler. Also the IO wait is very low, so it is really just the CPU which is having a lot work. It seems that I
I will have a look now at the iterators thing which looks little bit confusing to me as I am not that familiar with perl at all.
Cheers
-Marco
| [reply] |
|
|
Thanks a lot for all of your answers.
You welcome! But then, please see below.
First of all the code I've posted was just some pseudo code. I didn't want you to read too much.
That is very kind of you, but going the pseudo code route is generally not the best thing to do: one is always advised to prepare minimal but complete examples instead.
So in my script I am not opening all the time a filehandler
So what? However minimal or "pseudo" you want to stay, putting the
my $writeFh = new FileHandle ">> myout.log";
line inside or outside of your loop takes exactly the same amount of space, and of keystrokes. So why not putting it where it better belongs to, to start with?
Also the IO wait is very low, so it is really just the CPU which is having a lot work.
Granted, your code has several, bigger problems, e.g. the if (<$fh> =~ m/$item/) thingy, but are you sure? How can you say that?
I will have a look now at the iterators thing which looks little bit confusing to me as I am not that familiar with perl at all.
As I wrote above, you welcome. But why didn't you write that you already asked the very same question (link @ GG) in clpmisc? Why didn't you write there you also asked here? What did they tell you there? Why didn't you report?
For completeness, I'm pasting all the messages of the thread hereafter:
| [reply] [d/l] [select] |
Re: how to parse large files
by newroz (Monk) on Mar 30, 2007 at 10:35 UTC
|
Hi,
Tie::File module can be used to access the gigantic file on disk, instead of necessity to split the file into chunks and slurp into memory.
| [reply] |
Re: how to parse large files
by kingkongrevenge (Scribe) on Mar 31, 2007 at 15:16 UTC
|
Use compiled regular expressions as the hash keys. This stops you from rebuilding the regex on every iteration. Alternatively, memoize buildRegexp. Or Tie::RegexpHash might be useful.
You can eliminate a for loop and the long ||'ed regexp by changing the data structure. Don't group the subset names in an array to be used as the key. Put each subset name in the hash as a single key but have the value of the hash as a reference to the array. Equivalent subset names point to the same array.
my %lookingFor;
my $ref = [1, 2, 3];
$lookingFor{qr/a.*blah/o} = $ref;
$lookingFor{qr/the.*blee/o} = $ref;
#Tie::RegexpHash would let you do $lookingFor{$bigfileline}
#without looping over keys() at all. No idea if it's faster.
The bottleneck is probably IO, but if it's not a more sophisticated data structure might help. You're not using the hash as a hash, but really as an array of arrays, i.e. an associative list. Is %lookingFor really big? If it is you might figure out a hierarchy and make it into a tree so you don't have to search linearly. | [reply] [d/l] |