Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

File read stopping prematurely

by Ineffectual (Scribe)
on Aug 28, 2013 at 00:03 UTC ( [id://1051209]=perlquestion: print w/replies, xml ) Need Help??

Ineffectual has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I have a gzipped file that contains a string of two bit binary codes. I'm attempting to read this in to do some work on the contents. However, when I attempt to use IO::Uncompress::AnyUncompress, each of the reads ends before the file end. Using gunzip -c | works just fine though. I've included a file snippet below and code. Is this a bug with IO::Uncompress::AnyUncompress? Is this something wrong with my code that attempts to use IO::Uncompress::AnyUncompress?

Edit: The file snippet below is where the read ends on line 899924 in each of the three files I'm processing. I don't know if those codes together add up to some sort of stop code.
1 899682 <B6>^E 1 899740 <B5>^E 1 899766 <B6>^E 1 899767 <B6>^E 1 899816 <B5>^E<B8>^E<BD>^E 1 899915 <B5>^E<B6>^E<B8>^E<BD>^E 1 899924 <B5>^E<B6>^E<B8>^E<BD>^E 1 900121 <B5>^E 1 900159 <B6>^E 1 900373 <B5>^E<B6>^E<B8>^E<BD>^E 1 900686 <B5>^E<B6>^E<BD>^E 1 900791 <B6>^E 1 900902 <B5>^E<B6>^E<B8>^E<BD>^E 1 900903 <B5>^E<B6>^E<B8>^E<BD>^E 1 901004 <B8>^E 1 901005 <B8>^E 1 901020 <B5>^E<B6>^E<B8>^E<BD>^E 1 901092 <B5>^E<B8>^E<BD>^E 1 901129 <B5>^E<B6>^E<B8>^E<BD>^E 1 901188 <B5>^E 1 901369 <B6>^E<B8>^E<BD>^E 1 901423 <B5>^E<BD>^E

Working code below results in line count 5042137.
my $lineCount = 0; foreach my $file (@whiFiles) { print "reading $file\n"; open IN, "gunzip -c $file |" or die "Can't open file $!"; while (<IN>) { next if ($_ =~ /^#/); chomp; my ($chr, $pos, $codes) = split(/\t/, $_); $lineCount++; } close IN; } print "Line count is $lineCount\n";

Non-working code below results in line count 13263.
my $lineCount = 0; foreach my $file (@whiFiles) { print "reading $file\n"; my $HANDLE = new IO::Uncompress::AnyUncompress($file,Transparent = +> 1, AutoClose=>1) or die; while (<$HANDLE>) { next if ($_ =~ /^#/); chomp; my ($chr, $pos, $codes) = split(/\t/, $_); $lineCount++; } close $HANDLE; } print "Line count is $lineCount\n";

Replies are listed 'Best First'.
Re: File read stopping prematurely
by kcott (Archbishop) on Aug 28, 2013 at 04:02 UTC

    G'day Ineffectual,

    I'm not actually familiar with IO::Uncompress::AnyUncompress so the following are just some troubleshooting hints as well as some guesswork based on the documentation (in particular, the OO Interface: Constructor Options section, referenced below).

    The discrepancy between the line counts of 5,042,137 and 13,263 is obviously quite substantial. Perhaps some files are being read as a single string due to "Transparent => 1": see "... treat the whole file/buffer as a single data stream." in the docs.

    Try moving the line counting code so it's counting individual files; maybe also check the size of the data being read. Something like this rough outline:

    foreach ... ... my $linecount = 0; my $datasize = 0; while (... $datasize += length; ... ++$linecount; } close ... print $linecount, $datasize; }

    I see you have "AutoClose => 1"; however, the docs say "This option is only valid when the $input parameter is a filehandle.". Your $input is a filename, not a filehandle, so perhaps turn that off or remove it altogether (the default is 0).

    Here's a few other things which probably (but not certainly) are unrelated to your current problem.

    • Use the 3-argument form of open with a lexical filehandle (e.g. open my $in_fh, '-|', "gunzip ...).
    • Heed the strong words of discouragement regarding use of Indirect Object Syntax in perlobj - Invoking Class Methods.
    • Add a message to your die (2nd instance).
    • Consider adding labels to your loops to clearly identify what next refers to.

    -- Ken

Re: File read stopping prematurely
by TJPride (Pilgrim) on Aug 28, 2013 at 04:26 UTC
    Is there some particular advantage to using a Perl library, as opposed to a command-line utility that is working for you? I would personally just use gunzip, read the file line by line if you don't want the whole thing in memory, then unlink it or gzip it again.
      This is a simplified version for troubleshooting of utility code that I use to open many different types of files. The utility code includes recognizing various file types and accessing files on local and remote servers. :)
Re: File read stopping prematurely
by pmqs (Friar) on Aug 28, 2013 at 12:09 UTC

    The first thing that occurs to me is that you are reading a binary file as if it were a text file. If any of the binary codes in the input file use the end of line characters (0x0A and 0x0D), you will end up reading the wrong thing.

    What platform are you running on? Windows does have the issue of the EOF character (ctrl-Z), but IO::Uncompress::AnyUncompress should ignore that.

    Any chance you could upload a data file, or better still the part of the file that doesn't work properly? Not sure if this site allow posting of binary files though.

      Interesting. I'm running on CENTOS linux. The file snippet above is the actual place where the file stops reading - it's a tab delimited file. I copied that snipped directly from vi where it seemed to stop reading.
Re: File read stopping prematurely
by Anonymous Monk on Aug 28, 2013 at 00:29 UTC

    Is this a bug with IO::Uncompress::AnyUncompress?

    Maybe (probably not) , maybe its a problem with the any being used :) do you have the latest version ( PMQS/IO-Compress-2.062.tar.gz )?

    maybe its a problem with the file.

    Is this something wrong with my code that attempts to use IO::Uncompress::AnyUncompress?

    Probably not (looks like its following docs).

Re: File read stopping prematurely
by bitingduck (Chaplain) on Aug 28, 2013 at 01:36 UTC

    What's on lines 13262 through 13264?

    And what happens if you turn off AutoClose?

      The file snippet is where the read ends on line 899924 in each of the three files I'm processing. I don't know if those codes together add up to some sort of stop code.

      I've tried the code with both Transparent and AutoClose off and it does the same thing.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1051209]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (1)
As of 2024-04-24 15:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found