Dear Monks,

I need to check some 60000+ bzipped files of which I know approx. 15 are faulty to find out which ones are broken. I also know how many lines these files should have and need to check that it's correct for each file.

I've written a script for this in which I tried to adapt the bzcat example from Compress::Bzip2 and combine it with code for forking from http://hell.jedicoder.net/?p=82 (the machine I work on supports running eight processes concurrently). Unfortunately it doesn't work the way I expected it to.

Basically, it tells me that almost every file is broken ("Error reading from <file>" for each file), prints blank lines for a few of them, and reports no file with the wrong number of lines, although I know there are some.

Can someone help? I thought it might have to do with forking and variable names, maybe...

use strict; use warnings; use Compress::Bzip2; my $error_file = "errors.txt"; open ERRORS, ">$error_file"; my @files = glob("$path"); my $correct_number_of_lines = 244812; while (@files) { # Code adapted from http://hell.jedicoder.net/?p=82 my @children = (); my %bz_handles = (); for(1..8) { my $file = shift @files; my $pid = fork(); if ($pid) { # Parent push @children, $pid; } elsif ($pid == 0) { # Child # Using bzcat function adapted from Compress::Bzip2 # documentation my $number_of_lines = 0; my $line = ""; # I tried $bz = bzopen($file, "rb") first but it # didn't work so I figured a hash might be better to # generate a new variable each time? $bz_handles{$pid} = bzopen($file, "rb") or print ERRORS "Cannot open $file" . "\n"; while ($bz_handles{$pid}->bzreadline($line) > 0) { $number_of_lines++; } if ($bz_handles{$pid}->bzerror != BZ_STREAM_END) { print ERRORS "Error reading from $file" . "\n"; } $bz_handles{$pid}->bzclose() ; if ($number_of_lines != $correct_number_of_lines) { print ERRORS $file . "\t" . $number_of_lines . "\n";; } exit(0); } else { die "couldn't fork: $!\n"; } } foreach my $child (@children) { waitpid($child, 0); } }

In reply to Checking number of lines in bzipped file by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.