Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks,
I need to check some 60000+ bzipped files of which I know approx. 15 are faulty to find out which ones are broken. I also know how many lines these files should have and need to check that it's correct for each file.
I've written a script for this in which I tried to adapt the bzcat example from Compress::Bzip2 and combine it with code for forking from http://hell.jedicoder.net/?p=82 (the machine I work on supports running eight processes concurrently). Unfortunately it doesn't work the way I expected it to.
Basically, it tells me that almost every file is broken ("Error reading from <file>" for each file), prints blank lines for a few of them, and reports no file with the wrong number of lines, although I know there are some.
Can someone help? I thought it might have to do with forking and variable names, maybe...
use strict; use warnings; use Compress::Bzip2; my $error_file = "errors.txt"; open ERRORS, ">$error_file"; my @files = glob("$path"); my $correct_number_of_lines = 244812; while (@files) { # Code adapted from http://hell.jedicoder.net/?p=82 my @children = (); my %bz_handles = (); for(1..8) { my $file = shift @files; my $pid = fork(); if ($pid) { # Parent push @children, $pid; } elsif ($pid == 0) { # Child # Using bzcat function adapted from Compress::Bzip2 # documentation my $number_of_lines = 0; my $line = ""; # I tried $bz = bzopen($file, "rb") first but it # didn't work so I figured a hash might be better to # generate a new variable each time? $bz_handles{$pid} = bzopen($file, "rb") or print ERRORS "Cannot open $file" . "\n"; while ($bz_handles{$pid}->bzreadline($line) > 0) { $number_of_lines++; } if ($bz_handles{$pid}->bzerror != BZ_STREAM_END) { print ERRORS "Error reading from $file" . "\n"; } $bz_handles{$pid}->bzclose() ; if ($number_of_lines != $correct_number_of_lines) { print ERRORS $file . "\t" . $number_of_lines . "\n";; } exit(0); } else { die "couldn't fork: $!\n"; } } foreach my $child (@children) { waitpid($child, 0); } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Checking number of lines in bzipped file
by Corion (Patriarch) on Apr 25, 2011 at 16:47 UTC | |
by Anonymous Monk on Apr 25, 2011 at 18:31 UTC | |
|
Re: Checking number of lines in bzipped file
by ikegami (Patriarch) on Apr 25, 2011 at 16:47 UTC | |
|
Re: Checking number of lines in bzipped file
by BrowserUk (Patriarch) on Apr 25, 2011 at 17:11 UTC |