Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Question: Can I have perl automatically buffer/store some amount of input from a gzip pipe so that I can "rewind" some small amount using tell then seek? Alternately, can I "push" a few lines back into the pipe? Hoping not to have to completely restructure my code to read zipped files. I have large < 5G input files, so I want to read only one-pass with small rewinds. I need to pre-fetch some keywords in each section to properly build a data structure. The position of the keywords can be anywhere within the section. AT this size reading the zipped version of a file is greatly preferred.

EX: ... open(LIBIN, "gunzip -c $read_file |") or die "\nERROR: ..."; ... while($line = <LIBIN>) { $self->{LINENUMBER}++; $file_line_location{$self->{LINENUMBER}} = tell LIBIN; ..... # Return to the start of a section after pre-parsing seek LIBIN , $file_line_location{$section_reentry_line}, 0; .... }
Suggestions?

Replies are listed 'Best First'.
Re: buffering zipped pipes (Tie::Handle)
by tye (Sage) on Oct 04, 2016 at 02:38 UTC

    See Tie::Handle and the documentation that it references (mostly perltie).

    Using that you can make a handle that hides the real file handle inside of itself. Then you can have your handle's implementation of READLINE squirrel away a copy of the read line to an internal buffer. You'd implement a SEEK that would set an attribute so that the next call to READLINE would read from the hidden buffer.

    But mixing byte-offset logic (seek) with read-line logic gets to be a pain. So implementing such a tied handle such that it can take care of all of the general cases may be more work than reworking your code a tiny bit so that you are calling methods and then implementing an object that only supports the functionality you need (read next line, save this line and subsequent lines, go back to the saved line, etc.). That way you could deal only in "lines".

    Alternately, you could implement a tied handle that works well enough if you only use it in the ways that your program currently does. So, for example, tell() would give you back a line number not a byte offset and trying to READ instead of READLINE would croak.

    - tye        

Re: buffering zipped pipes
by BrowserUk (Patriarch) on Oct 04, 2016 at 08:07 UTC

    I'd skip all the seeks and tells and use a thread and a queue this way:

    #! perl -slw use strict; use threads; use Thread::Queue; my $file = $ARGV[ 0 ]; my $Q = new Thread::Queue; async { open LIBBIN, "gunzip -c $file |" or die $!; while( <LIBBIN> ) { $Q->enqueue( $_ ); sleep 1 while $Q->pending > 100; ## Arbitrary + limit to stop it from running away with memory } $Q->enqueue( undef ); ## close queue }->detach; ## Main code swaps $Q->dequeue for <LIBBIN> ... while( my $line = $Q->dequeue ) { if( some circumstance ) { unshift @{ $Q->{queue} }, $line; ## 'push' a li +ne back to the queue. } ## do other stuff here.. }

    The only slightly 'trick' thing here is that I'm reaching inside the Queue object to gain access to the underlying array reference with $Q->{queue}, which the OO purists would have a hissy fit about, but to my mind one of the big advantages of Perl's simple native OO mechanism, is that I as a user can easily extend -- or in this case regain access to existing -- functionality without having to go cap in hand to the authors of the module or complicate things with OO complexity.

    If the basic mechanism of having a separate thread doing the IO and a queue to buffer the data fits with your mindset, then you should also take a look at the "ADVANCED METHODS" in the Thread::Queue docs. For the most part I consider their addition to the queue module an abomination, but for the first time ever, I can see your application might find them useful.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Re: buffering zipped pipes
by RonW (Parson) on Oct 04, 2016 at 18:36 UTC
    I need to pre-fetch some keywords in each section to properly build a data structure. The position of the keywords can be anywhere within the section.

    Why not read a whole section into a buffer, scan the the buffer for the keywords, then process the whole section.

Re: buffering zipped pipes
by Anonymous Monk on Oct 04, 2016 at 14:46 UTC
Re: buffering zipped pipes
by Anonymous Monk on Oct 04, 2016 at 22:51 UTC

    OK, thought the suggestion of using the IO::Uncompress was great. BUT sad face ;( message returned below tells me that it is not an option to seek backwards with the module:<\p>

    <\p>

    IO::Uncompress::Gunzip::seek: cannot seek backwards at ...<\p>