clivehumby has asked for the wisdom of the Perl Monks concerning the following question:

I am processing very large files of text in PERL looking for things that happen after a recursive pattern

I am memory caching the whole file and the files vary from 600MB to around 4.2GB.

The process is fast and clean but fails on files over 2GB; the actual point is somewhere near string position 1949803025.. after this point the INDEX returns the same value; I have even tested a start address wel beyond this and the INDEX command still returns a the 1949803025 address (which is a correct address for the pattern).

Any suggestions why this may happen and how it could be overcome

use strict; use warnings; ## set up data in memory my $tm=time; my $file= "FRED.DAT"; my $data; { open my $fh, '<', $file or die; local $/ = undef; $data = <$fh>; close $fh; } ## report load statistics my $str="XYZ"; my $lx=length($data); my $tmx=time-$tm; my $r=$lx/$tmx; print "File $file cached $lx bytes in $tmx seconds @ $r bs\n"; ## scan mega string for patterns and do stuff my $nextposn=0; my $offset=0; ## experiment with offset beyond 1949803025 ## $offset=2500000000; my $found=0; my $occ=0; while ($nextposn < $lx ) { $nextposn = index($data,$str, $offset); if($nextposn < 0) {goto NOMORE;} $found++; ## do stuff you need to do with the next characters ## ## $offset = $nextposn+1; ## report progress $occ++; if ($occ == 1000000) {print "$found so far $nextposn\n"; $occ=0;} } ## diagnostics NOMORE: print "Processed $found patterns, maximum position was $nextposn\n";

Replies are listed 'Best First'.
Re: INDEX limits
by choroba (Cardinal) on Nov 11, 2015 at 13:48 UTC
    Are you running a 32-bit perl?

    It works for me for longer strings:

    #! /usr/bin/perl use warnings; use strict; use feature qw{ say }; my $long_string = 'ab' x 3e9; my $pos = index $long_string, 'ba', 6e9 - 4; say $pos; __END__ 5999999997
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      My first thoughts too

      Running Strawberry Perl on Windows V5.18.2 64-bit version

Re: INDEX limits
by hippo (Archbishop) on Nov 11, 2015 at 13:43 UTC
    The process is fast and clean but fails on files over 2GB

    That screams "integer limit". What's the output of perl -v ?

      Perl V5.18.2 the length command returns the correct value....

        That's the version of perl only. The output from perl -v also includes the architecture which is why it would be pertinent here. eg:

        This is perl 5, version 20, subversion 3 (v5.20.3) built for x86_64-li +nux-thread-multi (with 15 registered patches, see perl -V for more detail) Copyright 1987-2015, Larry Wall Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using "man perl" or "perldoc perl". If you have access to + the Internet, point your browser at http://www.perl.org/, the Perl Home Pa +ge.
Re: INDEX limits
by Anonymous Monk on Nov 11, 2015 at 15:01 UTC

    my $D = pack '(.* A*)*', map { $_, "foo" } 77777, 1234000000, 2345000111 ; my $p = 0; { $p = index($D, "foo", $p); warn "p=$p\n"; redo if ++$p; }
    Segfaults for me with perl5.20.2 and earlier. Works with 5.22.0. All are x86_64-linux-thread-multi.

Re: INDEX limits
by pme (Monsignor) on Nov 12, 2015 at 10:20 UTC
    Hi clivehumby,

    I think there is something wrong with the I/O here. Perl scalars can hold values up to 2^53 (see maximum value of a scalar for details). For testing purpouses I created two big files F1.DAT and F2.DAT, their size are 1GB and 2GB respectively and modified your code as follows:

    use strict; use warnings; my $int2 = 2147483648; my $int4 = 2147483648 * 2; my $int8 = 2147483648 * 4; my $int16 = 2147483648 * 8; print "int2 $int2\n"; print "int4 $int4\n"; print "int8 $int8\n"; print "int16 $int16\n"; ## set up data in memory my $tm=time; my $file= $ARGV[0]; my $data; { open my $fh, '<', $file or die; local $/ = undef; $data = <$fh>; close $fh; } ## report load statistics my $str="XYZ"; my $lx=length($data); my $tmx=time-$tm; my $r=$lx/$tmx; print "File $file cached $lx bytes in $tmx seconds @ $r bs\n";
    I run the script with different perl versions and the result can be found below. 5.8.9 and 5.10.1 failed but 5.20.2 worked correctly.
Re: INDEX limits
by Laurent_R (Canon) on Nov 11, 2015 at 15:18 UTC
    Can you please show the output when running your program? And any error message you're getting?