INDEX limits

clivehumby has asked for the wisdom of the Perl Monks concerning the following question:

I am processing very large files of text in PERL looking for things that happen after a recursive pattern

I am memory caching the whole file and the files vary from 600MB to around 4.2GB.

The process is fast and clean but fails on files over 2GB; the actual point is somewhere near string position 1949803025.. after this point the INDEX returns the same value; I have even tested a start address wel beyond this and the INDEX command still returns a the 1949803025 address (which is a correct address for the pattern).

Any suggestions why this may happen and how it could be overcome

use strict;
use warnings;
## set up data in memory
my $tm=time;
my $file= "FRED.DAT";
my $data;
{
    open my $fh, '<', $file or die;
    local $/ = undef;
    $data = <$fh>;
    close $fh;
}

## report load statistics
my $str="XYZ";
my $lx=length($data);
my $tmx=time-$tm;
my $r=$lx/$tmx;
print "File $file cached $lx bytes in $tmx seconds @ $r bs\n";

## scan mega string for patterns and do stuff
my $nextposn=0;
my $offset=0;
## experiment with offset beyond 1949803025
## $offset=2500000000;
my $found=0;  my $occ=0;
while ($nextposn < $lx ) {
   $nextposn = index($data,$str, $offset);
   if($nextposn < 0) {goto NOMORE;}
   $found++;
   ## do stuff you need to do with the next characters
   ##
   ##
   $offset = $nextposn+1;
   ## report progress
   $occ++;
   if ($occ == 1000000) {print "$found so far $nextposn\n"; $occ=0;}
   }
## diagnostics  
NOMORE: 
print "Processed $found patterns, maximum position was $nextposn\n";
[download]

Comment on INDEX limits Download Code

Replies are listed 'Best First'.
Re: INDEX limits by choroba (Cardinal) on Nov 11, 2015 at 13:48 UTC
Are you running a 32-bit perl? It works for me for longer strings: `#! /usr/bin/perl use warnings; use strict; use feature qw{ say }; my $long_string = 'ab' x 3e9; my $pos = index $long_string, 'ba', 6e9 - 4; say $pos; __END__ 5999999997` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: INDEX limits by clivehumby (Initiate) on Nov 11, 2015 at 14:08 UTC
My first thoughts too Running Strawberry Perl on Windows V5.18.2 64-bit version	[reply]
Re: INDEX limits by hippo (Archbishop) on Nov 11, 2015 at 13:43 UTC
The process is fast and clean but fails on files over 2GB That screams "integer limit". What's the output of `perl -v` ?	[reply] [d/l]
Re^2: INDEX limits by clivehumby (Initiate) on Nov 11, 2015 at 13:57 UTC
Perl V5.18.2 the length command returns the correct value....	[reply]
Re^3: INDEX limits by hippo (Archbishop) on Nov 11, 2015 at 14:10 UTC
That's the version of perl only. The output from `perl -v` also includes the architecture which is why it would be pertinent here. eg: This is perl 5, version 20, subversion 3 (v5.20.3) built for x86_64-li +nux-thread-multi (with 15 registered patches, see perl -V for more detail) Copyright 1987-2015, Larry Wall Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using "man perl" or "perldoc perl". If you have access to + the Internet, point your browser at http://www.perl.org/, the Perl Home Pa +ge. [download]	[reply] [d/l] [select]
Re: INDEX limits by Anonymous Monk on Nov 11, 2015 at 15:01 UTC
`my $D = pack '(.* A)', map { $_, "foo" } 77777, 1234000000, 2345000111 ; my $p = 0; { $p = index($D, "foo", $p); warn "p=$p\n"; redo if ++$p; }` [download] Segfaults for me with perl5.20.2 and earlier. Works with 5.22.0. All are x86_64-linux-thread-multi.	[reply] [d/l]
Re: INDEX limits by pme (Monsignor) on Nov 12, 2015 at 10:20 UTC
Hi clivehumby, I think there is something wrong with the I/O here. Perl scalars can hold values up to 2^53 (see maximum value of a scalar for details). For testing purpouses I created two big files F1.DAT and F2.DAT, their size are 1GB and 2GB respectively and modified your code as follows: use strict; use warnings; my $int2 = 2147483648; my $int4 = 2147483648 * 2; my $int8 = 2147483648 * 4; my $int16 = 2147483648 * 8; print "int2 $int2\n"; print "int4 $int4\n"; print "int8 $int8\n"; print "int16 $int16\n"; ## set up data in memory my $tm=time; my $file= $ARGV[0]; my $data; { open my $fh, '<', $file or die; local $/ = undef; $data = <$fh>; close $fh; } ## report load statistics my $str="XYZ"; my $lx=length($data); my $tmx=time-$tm; my $r=$lx/$tmx; print "File $file cached $lx bytes in $tmx seconds @ $r bs\n"; [download] I run the script with different perl versions and the result can be found below. 5.8.9 and 5.10.1 failed but 5.20.2 worked correctly. Read more... (4 kB)	[reply] [d/l] [select]
Re: INDEX limits by Laurent_R (Canon) on Nov 11, 2015 at 15:18 UTC
Can you please show the output when running your program? And any error message you're getting?	[reply]