I am processing very large files of text in PERL looking for things that happen after a recursive pattern

I am memory caching the whole file and the files vary from 600MB to around 4.2GB.

The process is fast and clean but fails on files over 2GB; the actual point is somewhere near string position 1949803025.. after this point the INDEX returns the same value; I have even tested a start address wel beyond this and the INDEX command still returns a the 1949803025 address (which is a correct address for the pattern).

Any suggestions why this may happen and how it could be overcome

use strict; use warnings; ## set up data in memory my $tm=time; my $file= "FRED.DAT"; my $data; { open my $fh, '<', $file or die; local $/ = undef; $data = <$fh>; close $fh; } ## report load statistics my $str="XYZ"; my $lx=length($data); my $tmx=time-$tm; my $r=$lx/$tmx; print "File $file cached $lx bytes in $tmx seconds @ $r bs\n"; ## scan mega string for patterns and do stuff my $nextposn=0; my $offset=0; ## experiment with offset beyond 1949803025 ## $offset=2500000000; my $found=0; my $occ=0; while ($nextposn < $lx ) { $nextposn = index($data,$str, $offset); if($nextposn < 0) {goto NOMORE;} $found++; ## do stuff you need to do with the next characters ## ## $offset = $nextposn+1; ## report progress $occ++; if ($occ == 1000000) {print "$found so far $nextposn\n"; $occ=0;} } ## diagnostics NOMORE: print "Processed $found patterns, maximum position was $nextposn\n";

In reply to INDEX limits by clivehumby

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.