Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Ok, here goes, I am working on a project to parse a continous line 3-7Meg text file. In Win2k. Yargh, I know. It is evil.

I have successfully found a way to parse the file, and figure for incorrect entries. No Problem.

I have successfully found out how to do updates to it (As the humongous nastiness gets updated constantly). However, the update process takes just as long as the initial build process.

What has worked so far is to reparse the evil file, comparing each entry to the last valid(parsed) entry in the good file. This, as I am sure you are aware, is lengthy.

What I tried to do is build in a binary split. I figure half the size of the file in bytes, and attempt to read() my next entry from this position. I get an Out of Memory message. I realise it is a retarded situation and I am making a stupid grevious error, but please help!

Of course my file is open, filepointer positioned at the beginning and $Size is the size of file in bytes. (I am betting $Size is my problem)Also $MonDay and $Year also have valid entries.


$Target = $Size / 2; read LIST, $NewString, $EntryLength, $Target; $NewString = substr($NewString, $Target); &Verify; $CmpMonDay = substr($NewString, 16, 4); $CmpYear = substr($NewString, 20, 4); #This *should* split the find time in half. I hope. #The following tree is used for the a binary split. if ($CmpYear == $Year && $MonDay > $CmpMonDay) { $PointerStart = $Target; } elsif ($Year > $CmpYear) { $PointerStart = $Target; } $Start = $PointerStart;

Replies are listed 'Best First'.
Re: Parsing a 4M+ Contiguous text file
by vladb (Vicar) on Dec 19, 2001 at 04:13 UTC
    Try using seek() on your file handler to read certain amount of data from the spot where you last stopped parsing this file. My understanding is that this should eliminate your "Out of Memory" problem and allow for greater control.
    # $fh is your file handler seek($fh, $old_position, 0); my $buffer = read($fh, 200000); # read ~200KB from the file # save current position $old_position = tell($fh); # do whatever you want with the buffer # here..
    cheers

    "There is no system but GNU, and Linux is one of its kernels." -- Confession of Faith
      Perl's read() takes at least three parameters, a la:   read($fh, $buffer, 200000)

          -- Chip Salzenberg, Free-Floating Agent of Chaos