Cyrnus has asked for the wisdom of the Perl Monks concerning the following question:

I have always heard that between these two methods of maintaining a data file the second was faster:

1) Read the file line by line (writing each line to a temp file) looking for the block that needs to be updated, change it, write the remaining lines to the temp file, then overwrite the origional file with the temp file.

2) Read the entire contents of the file into an array, step through the array looking for the block that needs to be updated, change it, then write back to the file.

After a recent disscussion on a message board I frequent, I decided to test it for myself. I came up with the following code which consistently prooves the first method faster. I tested it using text files that ranged in size from 30MB to 90MB. Is the first method really faster or is there something in the way I'm implementing the second method that slows it down (like perhaps using push to add the data to the second array)
#!/usr/bin/perl -w use strict; my $stime = time(); my $filename = "file"; my $tempfile = "temp"; my $line; open(OLD, "< $filename") or die "can't open $filename: $!"; open(NEW, "> $tempfile") or die "can't open $tempfile: $!"; while ($line = <OLD>) { # a code block to evaluate the current line # and possibly update it goes here print NEW $line or die "can't write $tempfile: $!"; } close(OLD) or die "can't close $filename: $!"; close(NEW) or die "can't close $tempfile: $!"; rename($filename, "$filename.bak") or die "can't rename $filename: $!" +; rename($tempfile, $filename) or die "can't rename $tempfile: $!"; my $ftime = time(); my $etime = $ftime - $stime; print "$etime\n"; $stime = time(); open(DATA,"$filename") or die "can't open $filename: $!"; my @data = <DATA>; my @vads; close(DATA)or die "can't close $filename: $!"; foreach $line (@data) { # a code block to evaluate the current line # and possibly update it goes here push(@vads,$line); } open(DATA,">$filename") or die "can't open $filename: $!"; foreach $line (@vads) { print DATA $line; } close(DATA) or die "can't close $filename: $!"; $ftime = time(); $etime = $ftime - $stime; print "$etime\n";
John

Replies are listed 'Best First'.
Smaller is better
by RMGir (Prior) on Apr 28, 2002 at 12:01 UTC
    For large files, you're certainly going to benefit from not keeping vast amounts of data in memory. You're going to improve in terms of memory allocation overhead, how efficiently your processor's cache is used, and generally reduced machine load (from having less to manage in virtual memory), I would think.

    So I'd expect that once your file passes a certain mystery size threshold, your first approach will be much faster.

    If you don't need the other lines, why keep them around? Of course, for a small file, it's hard to beat the ease of typing

    my @lines=<FILE>;
    If your problem requires multiple passes over the data, or transformations across all the lines, then you may not have a choice but to keep it all in memory, though. An example of that would be a matrix transposition, for instance.
    --
    Mike
Re: Speed differences in updating a data file
by SleepyDave (Initiate) on Apr 28, 2002 at 04:59 UTC
    Well, why I don't know for certain about the speed, I've usually found the first faster myself when dealing with large files. Also, a group I was working with had some problems with extremely large arrays getting dropped and then files being re-written blank, so we started doing line by line reading and temp file usage. Worked out a lot better.
Re: Speed differences in updating a data file
by CukiMnstr (Deacon) on Apr 28, 2002 at 08:48 UTC
    This does not answer your question (the previous answers should give enough information anyway), but I thought you might find interesting a third method (it's always useful to have many tricks in the bag): try Tie::File, by Dominus. It lets you manipulate a file as a regular perl array (each file record -- one line is the default -- corresponding to an array element) without loading the whole file into memory, so it is particularly good when dealing with big files.

    hope this helps,

      The other day in Performing a tail(1) in Perl (reading the last N lines of a file), Chmrr showed that Tie::File is much, much slower than simpler read-by-line methods for reading files. This was for getting the end of the file, so I suspect that while the file is not loaded into memory, it is still read sequentially from beginning to end. It would be wise to benchmark the different approaches before settling on the technique to use.


      print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'
Re: Speed differences in updating a data file
by BUU (Prior) on Apr 28, 2002 at 07:10 UTC
    throwing around 30 meg arrays probably doesnt help..