Parsing a 4M+ Contiguous text file

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Ok, here goes, I am working on a project to parse a continous line 3-7Meg text file. In Win2k. Yargh, I know. It is evil.

I have successfully found a way to parse the file, and figure for incorrect entries. No Problem.

I have successfully found out how to do updates to it (As the humongous nastiness gets updated constantly). However, the update process takes just as long as the initial build process.

What has worked so far is to reparse the evil file, comparing each entry to the last valid(parsed) entry in the good file. This, as I am sure you are aware, is lengthy.

What I tried to do is build in a binary split. I figure half the size of the file in bytes, and attempt to read() my next entry from this position. I get an Out of Memory message. I realise it is a retarded situation and I am making a stupid grevious error, but please help!

Of course my file is open, filepointer positioned at the beginning and $Size is the size of file in bytes. (I am betting $Size is my problem)Also $MonDay and $Year also have valid entries.

$Target = $Size / 2;
read LIST, $NewString, $EntryLength, $Target;
$NewString = substr($NewString, $Target);
&Verify;
$CmpMonDay = substr($NewString, 16, 4);
$CmpYear = substr($NewString, 20, 4);

#This *should* split the find time in half. I hope.
#The following tree is used for the a binary split.

if ($CmpYear == $Year && $MonDay > $CmpMonDay) {
    $PointerStart = $Target;
}
elsif ($Year > $CmpYear) {
    $PointerStart = $Target;
}
    
$Start = $PointerStart;
[download]

Comment on Parsing a 4M+ Contiguous text file Download Code

Replies are listed 'Best First'.

Re: Parsing a 4M+ Contiguous text file
by vladb (Vicar) on Dec 19, 2001 at 04:13 UTC

# $fh is your file handler
seek($fh, $old_position, 0);
my $buffer = read($fh, 200000); # read ~200KB from the file
# save current position
$old_position = tell($fh);

# do whatever you want with the buffer
# here..
[download]

"There is no system but GNU, and Linux is one of its kernels." -- Confession of Faith

[reply]
[d/l]

Re: Re: Parsing a 4M+ Contiguous text file

by chip (Curate) on Dec 19, 2001 at 12:23 UTC

read()

read($fh, $buffer, 200000)

-- Chip Salzenberg, Free-Floating Agent of Chaos

[reply]
[d/l]
[select]