Grundle has asked for the wisdom of the Perl Monks concerning the following question:

I have written a parsing module that should be able to handle very large files containing database dumps etc. Some files can reach the size of up to 1 or 2 Gig so it is important to parse the file by part instead of trying to load it into memory.

After running the program it appears to me that my program is getting stuck in an infinite loop and restarting around number 384000. Either my program is re-starting and parsing the same file again (which I doubt since I delete the file on completion), or it is somehow choking and re-starting the parse from the file's beginning

The following are the pertinent pieces of code. Am I going about this parse in the wrong way?

The following is some of the module code
my @articles = (); my @titles = (); my $reader = new IO::Handle; my $done_flag = 0; my $total = 0; sub new{ my $class = shift; my $self = bless{}; return $self; } sub parseFile{ my ($self, $file) = @_; if(open(READ, $file)){ if(!$reader->fdopen(fileno(READ), "r")){ die "Cannot open file [$file] for reading\n"; } }else{ die "Cannot open file [$file] for reading\n"; } loadContents(); } sub loadContents{ my $count = 0; while((my $line = $reader->getline) && $count < 3000){ if(($line =~ /^\d/) && !($line =~ /\Q#REDIRECT\E/)){ #new valid record my @data = split /\s+-separator-\s+/, $line; push @articles, $data[2]; push @titles, $data[1]; $count++; $total++; } } if($count < 3000){ $done_flag = 1; } #close(READ); } sub closeParser{ $reader->close; } sub getArticle{ if(scalar(@articles) > 0){ my $article = pop @articles; my $title = pop @titles; return parseData($article, $title); }else{ return (); } } sub hasArticles{ if(scalar(@articles) == 0 && $done_flag){ return 0; }elsif(scalar(@articles) == 0 && !$done_flag){ loadContents(); return 1; }else{ return 1; } }
From my main program the following is used to call the parser and run through the entire file.
my $parser = new Parser(); $parser->parseFile($feed_location$file); while($parser->hasArticles()){ my ($url, $author, $source, $date, $title, $body) = $parse +r->getArticle(); #database logic taking place here. } $parser->closeParser(); unlink($feed_location$file);
Thanks for any help!

Replies are listed 'Best First'.
Re: Parsing large files
by tilly (Archbishop) on Apr 10, 2005 at 19:23 UTC
    My best guess is that the spot that it is choking is almost exactly 2 GB through the file, and your version of Perl is not compiled with support for large files. Therefore when it passes the 2 GB point the next seek takes it to position 0 in your file. To see whether your version supports large files, run perl -V and look for -Duselargefiles.

    If that is the problem then I know of 2 solutions. One is to compile Perl with support for large files. The other is to change your open line to something like:

    if(open(READ, "cat $file |")){
    (I'm assuming that you are on an OS with cat installed, and that cat has support for large files.)

    If you don't replace Perl, then you'll need a similar trick to write large files, because Perl will again get confused around the 2 GB mark.

      That is sort of what I suspected, but I am wondering if it may be even more directly related to IO::Handle. I am not sure if IO::Handle has any size limitations or if it just works on the filehandle without having to worry about space etc.

      I like your suggestion about changing the open() statement, but before I settle on that I want to be absolutely sure that is what is going on. This process takes about 22hours to complete, so obviously if I am wrong, it will be a costly (timewise) mistake.

      Thanks!!
        IO::Handle just works on the filehandle.

        If performance is an issue, though, note that there is some overhead (or at least was at one point, it may have improved since I checked) in using IO::Handle's OO support. So it may be faster to use <> directly.

        Additionally you might want to avoid using a threaded Perl (those are slower even if you don't use threads), and on some platforms it can be faster to call read and then split the lines yourself than it is to let it be done with <>. On others the built-in is faster, and I believe that with a current Perl the performance problem behind that should be eliminated everywhere.