Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to use Search::Dict on a large gzipped file of words. Can I do this in perl w/ cpan? Specifically, I need to be able to do random-access seeks within the gzipped file. Most of the gzip modules (Tie::Gzip, IO::Uncompress::Gunzip, IO::Zlib) disclaim this functionality. Although I see there is a gzseek method in zlib, so am pretty sure this must be possible. I am not sure about PerlIO::gzip. I am running linux.

Replies are listed 'Best First'.
Re: gzseek for perl filehandles
by BrowserUk (Patriarch) on Dec 23, 2010 at 15:36 UTC

    You'd be well to consider the performance impact of random seeks within a compressed file in the light of this from the zlib docs:

    If file is open for reading, the implementation may still need to uncompress all of the data up to the new offset. As a result, gzseek() may be extremely slow in some circumstances.

    Just how large is your file of words?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      That's what,
            --rsyncable   Make rsync-friendly archive
      
      should solve. Although I am open to using a different compressor other than gzip if it has random access. I think .zip must support this for example, but I don't see a compressed filehandle library with random seek support in cpan.
        File size is 1GB compressed: It's not words but phrases that I want to match against.

        Googling site:zlib.net rsyncable turns up 0 hits?

Re: gzseek for perl filehandles
by ahmad (Hermit) on Dec 24, 2010 at 01:59 UTC

    I don't know what you are trying to do, but I would suggest that you switch into a real database instead of files and just use SQL queries to get what you want out of if.

    It would be faster & compressed?

Re: gzseek for perl filehandles
by furry_marmot (Pilgrim) on Dec 24, 2010 at 17:51 UTC

    Others have said something like this, but I thought I'd throw my two cents in and try to nail it on the head.

    On the face of it, since the text you seek is compressed, random-access seeking is not possible. To seek 150K into the compressed file is meaningless in the context of the uncompressed text. Seek 150K into the compressed file, uncompress it, and maybe you're 300K into the uncompressed text, or maybe 1.5Mb. You have to uncompress a bunch of it AND THEN do your seeks. Consider what BrowserUK quoted in the first reply to your post:

    If file is open for reading, the implementation may still need to uncompress all of the data up to the new offset. As a result, gzseek() may be extremely slow in some circumstances.

    In other words, some, or most, or all of the file must be uncompressed before you can do your random seeks -- and that's for each call to gzseek()! Performance will suck heavily.

    In your responses, you've made it clear that you're still looking for something that will do random seeks into a compressed file. So I repeat, IT'S NOT POSSIBLE. It's just a parlor trick. Whatever module you find or write will either uncompress the file a little at a time to find what you're looking for, or will uncompress the whole file and search through that.

    If you can't get away from the size of the file, you might consider rethinking your approach. Can you turn the process around? Can you uncompress the file a block at a time, and then process the phrases as you read them, rather than seek each phrase separately in the file (which it sounds like you want to do)? If you really, truly have to search the whole file for each phrase, the fastest solution is probably to uncompress it yourself (keeping the compressed version so you don't have to re-compress it), do whatever you're doing, and then delete it.

    My two cents.

    --marmot
      OP here,

      I think I found some equivalent off-the-shelf solutions:

      fusecompress and fuse-zip and compFUSEd.

      If anyone has any experience with the above they can share, it would be greatly appreciated.

      However, I still think the original idea could work with an ordered dictionary, given that resets can be identified.

        could work with an ordered dictionary

        Perhaps you should take a look at cdb and CDB_File. If size is an issue, think about compressing each record separately before stuffing it into cdb.

        Also, think about using SQLite (i.e. DBI and DBD::SQLite).

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)