clivehumby has asked for the wisdom of the Perl Monks concerning the following question:
I am processing very large files of text in PERL looking for things that happen after a recursive pattern
I am memory caching the whole file and the files vary from 600MB to around 4.2GB.
The process is fast and clean but fails on files over 2GB; the actual point is somewhere near string position 1949803025.. after this point the INDEX returns the same value; I have even tested a start address wel beyond this and the INDEX command still returns a the 1949803025 address (which is a correct address for the pattern).
Any suggestions why this may happen and how it could be overcome
use strict; use warnings; ## set up data in memory my $tm=time; my $file= "FRED.DAT"; my $data; { open my $fh, '<', $file or die; local $/ = undef; $data = <$fh>; close $fh; } ## report load statistics my $str="XYZ"; my $lx=length($data); my $tmx=time-$tm; my $r=$lx/$tmx; print "File $file cached $lx bytes in $tmx seconds @ $r bs\n"; ## scan mega string for patterns and do stuff my $nextposn=0; my $offset=0; ## experiment with offset beyond 1949803025 ## $offset=2500000000; my $found=0; my $occ=0; while ($nextposn < $lx ) { $nextposn = index($data,$str, $offset); if($nextposn < 0) {goto NOMORE;} $found++; ## do stuff you need to do with the next characters ## ## $offset = $nextposn+1; ## report progress $occ++; if ($occ == 1000000) {print "$found so far $nextposn\n"; $occ=0;} } ## diagnostics NOMORE: print "Processed $found patterns, maximum position was $nextposn\n";
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: INDEX limits
by choroba (Cardinal) on Nov 11, 2015 at 13:48 UTC | |
by clivehumby (Initiate) on Nov 11, 2015 at 14:08 UTC | |
|
Re: INDEX limits
by hippo (Archbishop) on Nov 11, 2015 at 13:43 UTC | |
by clivehumby (Initiate) on Nov 11, 2015 at 13:57 UTC | |
by hippo (Archbishop) on Nov 11, 2015 at 14:10 UTC | |
|
Re: INDEX limits
by Anonymous Monk on Nov 11, 2015 at 15:01 UTC | |
|
Re: INDEX limits
by pme (Monsignor) on Nov 12, 2015 at 10:20 UTC | |
|
Re: INDEX limits
by Laurent_R (Canon) on Nov 11, 2015 at 15:18 UTC |