comment on

I am processing very large files of text in PERL looking for things that happen after a recursive pattern

I am memory caching the whole file and the files vary from 600MB to around 4.2GB.

The process is fast and clean but fails on files over 2GB; the actual point is somewhere near string position 1949803025.. after this point the INDEX returns the same value; I have even tested a start address wel beyond this and the INDEX command still returns a the 1949803025 address (which is a correct address for the pattern).

Any suggestions why this may happen and how it could be overcome

use strict;
use warnings;
## set up data in memory
my $tm=time;
my $file= "FRED.DAT";
my $data;
{
    open my $fh, '<', $file or die;
    local $/ = undef;
    $data = <$fh>;
    close $fh;
}

## report load statistics
my $str="XYZ";
my $lx=length($data);
my $tmx=time-$tm;
my $r=$lx/$tmx;
print "File $file cached $lx bytes in $tmx seconds @ $r bs\n";

## scan mega string for patterns and do stuff
my $nextposn=0;
my $offset=0;
## experiment with offset beyond 1949803025
## $offset=2500000000;
my $found=0;  my $occ=0;
while ($nextposn < $lx ) {
   $nextposn = index($data,$str, $offset);
   if($nextposn < 0) {goto NOMORE;}
   $found++;
   ## do stuff you need to do with the next characters
   ##
   ##
   $offset = $nextposn+1;
   ## report progress
   $occ++;
   if ($occ == 1000000) {print "$found so far $nextposn\n"; $occ=0;}
   }
## diagnostics  
NOMORE: 
print "Processed $found patterns, maximum position was $nextposn\n";
[download]

In reply to INDEX limits by clivehumby

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.