bop has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Using a 64bit version of perl, I'm loading a - say - 3GB file into memory as a string. Once loaded, I check the length of the string which is shown as 3GB. I then access the data in the string through the vec function. This works fine for offsets smaller than 2GB, however, for offsets larger than 2GB I get undefined (or 0) values, which shouldn't be the case (let's say I know the file contains only non 0 values). Is this a known issue? I couldn't find anything in regards to this through web searches. (In case this makes a difference - the issue may be implementational - I'm using activestate perl.) Thank you, bop

Replies are listed 'Best First'.
Re: 2GB limit to vecs
by almut (Canon) on Jun 20, 2008 at 07:14 UTC

    The prototype of the respective function (doop.c, line 741) suggests the offset is in fact 32-bit:

    UV Perl_do_vecget(pTHX_ SV *sv, I32 offset, I32 size)

    at least if the type I32 always is 32-bit — which it seems to be...  as it turned out in a recent thread.

    (As it's a signed int, the largest positive value is 231-1, or 2,147,483,647.)

Re: 2GB limit to vecs
by salva (Canon) on Jun 20, 2008 at 07:32 UTC
    perl 5.10.0 source says
    Perl_do_vecget(pTHX_ SV *sv, I32 offset, I32 size)
    I32 is perl type alias for 32 bit signed integers.

    I don't know if there is any reason to be using I32 for indexes instead of IVs (native ints or bigger). Vectors are not the unique data structure with that limitation, the I32 type is used pervasively inside perl code for index values, for instance, arrays or substr also use 32 bit indexes.

    The work around is to create your own XS vector module with support for 64bit indexes, but anyway, fill a bug report with perlbug, and it may be fixed for 5.12!

      or sooner ( 5.10.1)
        Eliminating all the I32 indexes on the interpreter would break binary compatibility, so it would be very unlikely to happen on the 5.10 branch. Though, just fixing vec could be.
Re: 2GB limit to vecs
by BrowserUk (Patriarch) on Jun 20, 2008 at 10:03 UTC

    Until a better version of vec is available, you might try something like this:

    sub myvec(\$$$) :lvalue { use constant TWO_GB => 2**31; my( $ref, $offset, $bits ) = @_; if( $offset > TWO_GB - 1 ) { $offset -= TWO_GB; $ref = \substr $$ref, ( TWO_GB * $bits ) / 8; } CORE::vec( $$ref, $offset, $bits ); }

    Which should be reasonably efficient as it avoids copying the huge string. If you wanted to get fancy in anticipation of the fixed version, you could stick it in a module and export it as CORE::GLOBAL::vec.

    The above is untested at the transition limit as I don't have enough memory to create strings that big. You might want to look closely at that TWO_GB - 1...


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Dear Monks,
      thank you for all your help! I used the same workaround (working with different chunks), albeit nowhere near as elegant as that - I will use this from now on. (I am on a 64 bit machine - I just had thought that when using a 64 bit version of perl everything would be running on 64 bits.)
      thank you,
      bop
        I am on a 64 bit machine

        If there is any chance that your strings will get bigger than 4GB, I think you just need to change the if for a while and the rest would take care of itself:

        sub myvec(\$$$) :lvalue { use constant TWO_GB => 2**31; my( $ref, $offset, $bits ) = @_; while( $offset > TWO_GB - 1 ) { $offset -= TWO_GB; $ref = \substr $$ref, ( TWO_GB * $bits ) / 8; } CORE::vec( $$ref, $offset, $bits ); }

        but again that's untested, so check it and convince yourself.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: 2GB limit to vecs
by samtregar (Abbot) on Jun 22, 2008 at 18:01 UTC

      Did you? It appears to me that Bit::Vector's new() takes a number of bits via an argument of type N_int which is defined as "unsigned int" which is likely 32 bits which means 2**32 bits or 2**32/8 bytes or 512MB, not even 2GB.

      I went to try Bit::Vector but ended up killing it before it finished allocating the (it appears) 512MB of memory.

      - tye        

        Too bad it didn't work. I think at the very least you'll need to be on a 64-bit platform where an int is 64-bits. That might let even vec() work since Perl's "32 bit" type might end up being 64 bits anyway.

        I suppose the obvious fix here is to split your data into chunks and make access a two step process - pick the right chunk and then vec() into it.

        -sam

Re: 2GB limit to vecs
by salva (Canon) on Jun 26, 2008 at 13:46 UTC