Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I need to check some 60000+ bzipped files of which I know approx. 15 are faulty to find out which ones are broken. I also know how many lines these files should have and need to check that it's correct for each file.

I've written a script for this in which I tried to adapt the bzcat example from Compress::Bzip2 and combine it with code for forking from http://hell.jedicoder.net/?p=82 (the machine I work on supports running eight processes concurrently). Unfortunately it doesn't work the way I expected it to.

Basically, it tells me that almost every file is broken ("Error reading from <file>" for each file), prints blank lines for a few of them, and reports no file with the wrong number of lines, although I know there are some.

Can someone help? I thought it might have to do with forking and variable names, maybe...

use strict; use warnings; use Compress::Bzip2; my $error_file = "errors.txt"; open ERRORS, ">$error_file"; my @files = glob("$path"); my $correct_number_of_lines = 244812; while (@files) { # Code adapted from http://hell.jedicoder.net/?p=82 my @children = (); my %bz_handles = (); for(1..8) { my $file = shift @files; my $pid = fork(); if ($pid) { # Parent push @children, $pid; } elsif ($pid == 0) { # Child # Using bzcat function adapted from Compress::Bzip2 # documentation my $number_of_lines = 0; my $line = ""; # I tried $bz = bzopen($file, "rb") first but it # didn't work so I figured a hash might be better to # generate a new variable each time? $bz_handles{$pid} = bzopen($file, "rb") or print ERRORS "Cannot open $file" . "\n"; while ($bz_handles{$pid}->bzreadline($line) > 0) { $number_of_lines++; } if ($bz_handles{$pid}->bzerror != BZ_STREAM_END) { print ERRORS "Error reading from $file" . "\n"; } $bz_handles{$pid}->bzclose() ; if ($number_of_lines != $correct_number_of_lines) { print ERRORS $file . "\t" . $number_of_lines . "\n";; } exit(0); } else { die "couldn't fork: $!\n"; } } foreach my $child (@children) { waitpid($child, 0); } }

Replies are listed 'Best First'.
Re: Checking number of lines in bzipped file
by Corion (Patriarch) on Apr 25, 2011 at 16:47 UTC

    Does your "reading" part work without the call to fork?

    Personally, I would just use a pipe from bzcat to read the file instead:

    open my $fh, "bzcat '$file' |" or die "Couldn't spawn [bzcat '$file']: $! / $?"; while (<$fh>) { $number_of_lines++ };

    Once that works, look at either using Dominus' runN script or re-wrap your script with the routine doing the "fork" calls. Alternatively, you can also look at Parallel::ForkManager to rate-limit your child programs.

      I like that :-) I was using Compress::Bzip2 because I couldn't figure out how to capture both STDOUT and STDERR from "bzcat <file> | wc -l". This gives me both :-) (And I've found other solutions by googling just now, too) The entry under "qx" in perlop says it's "easiest" to re-direct them to files and then read the files in again, and I mistakenly understood "easiest = only way" :-(

Re: Checking number of lines in bzipped file
by ikegami (Patriarch) on Apr 25, 2011 at 16:47 UTC

    I don't know about your errors, but I did notice a problem with your parallelisation: You wait always for the slowest of the first eight (16th, 24th, etc) to complete before moving on to the ninth (17th, 25th, etc). Parallel::ForkManager does exactly what you're doing without that bug.

    I tried $bz = bzopen($file, "rb") first but it didn't work so I figured a hash might be better to generate a new variable each time?

    If you want a new variable, use my.

    There's no difference between a scalar and a hash with a single element, so it's nonsense to use a hash there.

Re: Checking number of lines in bzipped file
by BrowserUk (Patriarch) on Apr 25, 2011 at 17:11 UTC

    Here's how I would tackle the task:

    #! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use constant EXPECTED_SIZE => 244812; my( $path, $T ) = @ARGV; my $Q = new Thread::Queue; my $sem :shared; my @threads = map{ async { while( my $file = $Q->dequeue ) { my $count = `bzcat $file | wc -l`; if( $count != EXPECTED_SIZE ) { lock $sem; warn "$file: $count\n"; } } } } 1 .. $T; $Q->enqueue( glob $path ); $Q->enqueue( (undef) x $T ); $_->join for @threads; __END__ 901213 \test\*.bz2 8 2> your.log

    Should run anywhere you have a threaded perl. The second command line argument is the number of concurrent processes to run.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.