Re: Efficiently parsing a large file
by matija (Priest) on Apr 08, 2004 at 20:24 UTC
|
I think you should create a hash that will hold state:
if (m/(/\w{6}-\w{5})\s+begin/) {
$state{$1}='begin';
} elsif (m/(\w{6}-\w{5})\s+doing-work/){
if ($state{$1} eq 'begin') {
$state{$1}='doing';
} else {
warn "$1: doing work without being\n";
}
} elsif ((m/\s+(\w{6}-\w{5})complete/) {
if ($state{$1} eq 'doing') {
$state{$1}='complete';
} else {
warn "$1: got complete, but state=\"$state{$1}\"\n";
}
}
Then at the end, you just go through the hash, and print all entries that aren't in the state complete.
That way, you only need to go through the file once. | [reply] [d/l] |
|
|
If the file is really huge (several hundred MB) it is important
to read the file only once
to keep memory allocation small
I would therefore suggest to delete the hash item in case of "complete"
delete ($state{$1});
Looping through the hash having read the whole file would then show just the non completed cases.
| [reply] [d/l] |
|
|
Hi Pelagic,
Unless I'm mistaken, I believe he wants to keep complete items.. For each line containing serial number and begin I have to find the matching, doing-work and complete entries. If the matching entries do not exist report it., then again, I might be wrong ;-)
Jason L. Froebe
No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1
| [reply] |
|
|
|
|
|
|
I agree with Neil: Very clever
I am a bit nervous about putting it into a hash that isn't tied to a file because of the amount of memory involved (hundreds of megs for the file). I recommend tying the hash to a file (gdbm or similar) so you can still access it without eating up all the memory on the box.
hope this helps
Jason L. Froebe
No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1
| [reply] |
|
|
I think that the decision to tie the hash or not depends not upon the size of the file that is being read but more upon what percentage of the file being read has meaningful entries.
If the majority of the entries in the file being read are just junk that can be ignored, then the state-hash can probably be maintained in memory w/o tying.
If the data source is very rich, though, then it would we wise to tie the state-hash to another file and manage the entries.
Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
| [reply] |
|
|
| [reply] |
Re: Efficiently parsing a large file
by DamnDirtyApe (Curate) on Apr 08, 2004 at 20:12 UTC
|
Have you considered using Tie::File? I don't honestly know if it will be more efficient or not, but it might make for a more sensible implementation.
_______________
DamnDirtyApe
Those who know that they are profound strive for clarity. Those who
would like to seem profound to the crowd strive for obscurity.
--Friedrich Nietzsche
| [reply] |
Re: Efficiently parsing a large file
by jeffa (Bishop) on Apr 08, 2004 at 20:14 UTC
|
Breezing through after being frustrated from work, but having to return to the grind ...
how about forking? Everytime you encounter a 'begin', fork off a process and have it continue
to search for the next two tokens.
Just a thought. :)
| [reply] |
Re: Efficiently parsing a large file
by tachyon (Chancellor) on Apr 09, 2004 at 03:13 UTC
|
How about using a bitmap state hash, deleting as you go to limit memory footprint. Single pass.....
my %states = (
'begin' => 1,
'doing-work' => 2,
'complete' => 4,
);
my $re = join '|', keys %states;
$re = qr/^(\S+)\s*($re)/;
my %fail_decode = (
6 => 'no begin',
5 => 'no doing work',
4 => 'no begin or doing work',
3 => 'no complete',
2 => 'no begin or complete',
1 => 'no doing work or complete',
);
my %status;
while(<DATA>) {
next unless /$re/;
$status{$1} += $states{$2};
delete $status{$1} if $status{$1} == 7;
}
printf "$_\t%s\n", $fail_decode{$status{$_}} for keys %status;
__DATA__
foo begin
bar begin
baz complete
foo doing-work
foo complete
bar doing-work
| [reply] [d/l] |
Re: Efficiently parsing a large file
by pizza_milkshake (Monk) on Apr 09, 2004 at 02:34 UTC
|
yeah, how about not looping through the file more than once for starters.
#!perl -wl
use strict;
#bit fields
my %status = qw(
begin 1
doing-work 2
complete 4
);
# figure out what a "perfect" score is
my $ok = (2 ** (scalar keys %status)) - 1;
# where we keep stuff
my (%entries);
#open F, "<logfile" or die $!;
while (<DATA>) {
if (/^(\d{4}HA-\d{5}) (\S+)$/) {
$entries{$1} |= $status{$2};
}
}
#close F;
#filter out entries that passed
for (grep{ $entries{$_} == $ok }keys %entries) {
delete $entries{$_};
}
printf(
"%d entries failed\n",
scalar keys %entries
);
for (sort keys %entries) {
print "$_: $entries{$_}";
}
__DATA__
1234HA-00001 begin
1234HA-00002 begin
1234HA-00003 begin
1234HA-00001 doing-work
1234HA-00002 doing-work
1234HA-00001 complete
perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'
| [reply] [d/l] |
Re: Efficiently parsing a large file
by bluto (Curate) on Apr 08, 2004 at 21:18 UTC
|
Using a hash as mentioned earlier is probably easiest though it can get slow on huge numbers of elements -- even if you have to resort to tieing a hash to a file. One low tech way, if you are using some unix variant and have extra disk, may be to use the unix sort utility before you pass through the file with perl. Then all of your entries should appear in order or at least close together (though you may have to play some games with sort's options to get the exact order you want).I've used sort on files in the GB range with millions of lines of text. Note: one good sanity check, in addition to checking for errors from sort, is to make sure the sorted file is the same size as the input file.
Update: Took out references to "GNU" since I'm not sure I've specifically used it on files that large (though it might work fine). | [reply] |
|
|
| [reply] |
|
|
sort will create separate temp files anyway which it then merges together and pipes to you. You really aren't gaining much since you need almost as much disk space. Also, pipes can slow down IO for huge files. They are usually limited to a few KB of buffer so you get a lot of swapping (i.e. sort feeds some bytes to you then goes to sleep, you wake up and process them then go to sleep, then sort feeds some more, etc.).
| [reply] |
|
|
Sorting with good tools can be very efficent but in this case you have to scan through the whole (sorted or unsorted) file one time anyway to find the (un-)complete sets. As the file file is nearly sorted anyway it's probably most efficent to not sort it explicitly.
| [reply] |
|
|
My impression from the original post was that lines were sorted within a serial number, but there could be many lines of other serial numbers interspersed in between. If this is wrong, then you are correct.
| [reply] |
|
|
|
|
Re: Efficiently parsing a large file
by Anonymous Monk on Apr 09, 2004 at 06:12 UTC
|
This should do the trick.
my %idx;
open FH, "logfile" or die "can't open logfile";
while (<FH>) {
my ($ser,$msg) = $_ =~ /^(.+?-.+?)\s+(\S+)/;
$idx{$ser}->{$msg}++;
if ($idx{$ser}->{begin} &&
$idx{$ser}->{doing-work} &&
$idx{$ser}->{complete} ) {
# found all the stuff, remove the entry.
delete $idx{$ser};
}
}
print "didn't complete:\n";
foreach my $ser (keys %idx)
{
print "\t$ser\n";
}
| [reply] [d/l] |
Re: Efficiently parsing a large file
by ttg (Initiate) on Apr 13, 2004 at 01:27 UTC
|
Neil,
Assuming some flavor of Unix...if not disregard.
You could pre-process the file with the uniq command which should
compress any concurrent entries in the log file into one line. That might
significantly reduce the number of lines you need to process. Then building
a hash of your opens with a value of 2, and testing your working settings and subtracting 1, (new value in hash is 1), then process your closes setting the
value of the corresponding open hash to 0 (subtract one again) might be workable. That way you would know what was opened, worked on, and closed, and it could be a one pass thru the file thing.
Just a thought.
tim | [reply] |