I am processing very large files of text in PERL looking for things that happen after a recursive pattern
I am memory caching the whole file and the files vary from 600MB to around 4.2GB.
The process is fast and clean but fails on files over 2GB; the actual point is somewhere near string position 1949803025.. after this point the INDEX returns the same value; I have even tested a start address wel beyond this and the INDEX command still returns a the 1949803025 address (which is a correct address for the pattern).
Any suggestions why this may happen and how it could be overcome
use strict; use warnings; ## set up data in memory my $tm=time; my $file= "FRED.DAT"; my $data; { open my $fh, '<', $file or die; local $/ = undef; $data = <$fh>; close $fh; } ## report load statistics my $str="XYZ"; my $lx=length($data); my $tmx=time-$tm; my $r=$lx/$tmx; print "File $file cached $lx bytes in $tmx seconds @ $r bs\n"; ## scan mega string for patterns and do stuff my $nextposn=0; my $offset=0; ## experiment with offset beyond 1949803025 ## $offset=2500000000; my $found=0; my $occ=0; while ($nextposn < $lx ) { $nextposn = index($data,$str, $offset); if($nextposn < 0) {goto NOMORE;} $found++; ## do stuff you need to do with the next characters ## ## $offset = $nextposn+1; ## report progress $occ++; if ($occ == 1000000) {print "$found so far $nextposn\n"; $occ=0;} } ## diagnostics NOMORE: print "Processed $found patterns, maximum position was $nextposn\n";
In reply to INDEX limits by clivehumby
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |