Thanks for the info, and the code snippet. I've got one question though - I just thought of using the UNIX 'cat' command to splice together the two halves of the file (outputting stuff at the 'insertion point' to a different file, coming back to insert the stuff in the original file, and then using 'cat' to combine the two files.
I'm going to be working with huge text files, though, so efficiency is a Very Big Deal. Do you know if using a system call to 'cat' will be less efficient than creating the temp file etc? I haven't found a comparable perl function.
Thanks.
| [reply] |
I'm going to be working with huge text files, though, so
efficiency is a Very Big Deal. Do you know if using a
system call to 'cat' will
be less efficient than creating the temp file etc?
You could try running some alternatives with the Benchmark
module, but I would expect that there are situations where a
system call to unix "cat" will be more efficient than doing
everything in Perl (and yours may be one such case).
Maybe you want to structure the process so that it
can "batch" a bunch of "random-access" inserts into a single
"post-edit" loop to interleave the sequential and "random
insertion" pieces of data -- e.g.
suppose that while you are writing a continuous, sequential
output stream, you are actually keeping track of multiple
"breakpoints" where future additions may need to be spliced
in. Suppose further that, each time you come up with some
data that needs to be spliced into one of those breakpoints
that were noted earlier in the output, you write this data
to some other temp file, or just keep it in a hash array keyed by,
say, the byte-offset where its supposed to go. When you
get to the end of the
sequential output, you can now read that back in portions
(from one breakpoint to the next), and interleave those with
the appropriate temp files or hash values in the required
sequence, to produce the intended final output.
Actually, with this sort of design, I would think that Perl
could easily provide the easiest and most effecient method:
use sysread on the sequentially written temp file to get the
chunks between breakpoints, and write these out to the final
file, interleaved with hash array elements that store the
"post-hoc insertion" pieces. I hope that's clear, but here's
a pseudo-code summary:
open( SEQ, ">sequential_output.file" );
$byteoffset = 0;
while there's data to be written {
$breakpoint{$byteoffset} = "" if this location might
need to get an insertion at a later point
if I have sequential data {
print SEQ $sequential_data;
$byte_offset += length($sequential_data);
}
else { # this is data that needs to be back-fitted to
# a byteoffset that I stored earlier
$back_offset = whatever previous byteoffset is right
$breakpoint{$back_offset} .= $backfit_data;
}
}
close SEQ;
$breakpoint{$byteoffset} = "";
# do that so the following loop will handle the final chunk
# final loop: put all the pieces together
# in the intended order
open(SEQ, "<sequential_output.file");
open(OUT, ">final_output.file");
foreach $chunk ( sort {$a<=>$b} keys %breakpoint ) {
sysread( SEQ, $chunkdata, $chunk );
print OUT $chunkdata, $breakpoint{$chunk}
}
| [reply] [d/l] |