Re: Re: Inserting text into the middle of a file without clobbering any other text

Thanks for the info, and the code snippet. I've got one question though - I just thought of using the UNIX 'cat' command to splice together the two halves of the file (outputting stuff at the 'insertion point' to a different file, coming back to insert the stuff in the original file, and then using 'cat' to combine the two files.

I'm going to be working with huge text files, though, so efficiency is a Very Big Deal. Do you know if using a system call to 'cat' will be less efficient than creating the temp file etc? I haven't found a comparable perl function.

Thanks.

Comment on Re: Re: Inserting text into the middle of a file without clobbering any other text

Replies are listed 'Best First'.
Re: Re: Re: Inserting text into the middle of a file without clobbering any other text by graff (Chancellor) on Aug 02, 2002 at 04:16 UTC
I'm going to be working with huge text files, though, so efficiency is a Very Big Deal. Do you know if using a system call to 'cat' will be less efficient than creating the temp file etc? You could try running some alternatives with the Benchmark module, but I would expect that there are situations where a system call to unix "cat" will be more efficient than doing everything in Perl (and yours may be one such case). Maybe you want to structure the process so that it can "batch" a bunch of "random-access" inserts into a single "post-edit" loop to interleave the sequential and "random insertion" pieces of data -- e.g. suppose that while you are writing a continuous, sequential output stream, you are actually keeping track of multiple "breakpoints" where future additions may need to be spliced in. Suppose further that, each time you come up with some data that needs to be spliced into one of those breakpoints that were noted earlier in the output, you write this data to some other temp file, or just keep it in a hash array keyed by, say, the byte-offset where its supposed to go. When you get to the end of the sequential output, you can now read that back in portions (from one breakpoint to the next), and interleave those with the appropriate temp files or hash values in the required sequence, to produce the intended final output. Actually, with this sort of design, I would think that Perl could easily provide the easiest and most effecient method: use sysread on the sequentially written temp file to get the chunks between breakpoints, and write these out to the final file, interleaved with hash array elements that store the "post-hoc insertion" pieces. I hope that's clear, but here's a pseudo-code summary: open( SEQ, ">sequential_output.file" ); $byteoffset = 0; while there's data to be written { $breakpoint{$byteoffset} = "" if this location might need to get an insertion at a later point if I have sequential data { print SEQ $sequential_data; $byte_offset += length($sequential_data); } else { # this is data that needs to be back-fitted to # a byteoffset that I stored earlier $back_offset = whatever previous byteoffset is right $breakpoint{$back_offset} .= $backfit_data; } } close SEQ; $breakpoint{$byteoffset} = ""; # do that so the following loop will handle the final chunk # final loop: put all the pieces together # in the intended order open(SEQ, "<sequential_output.file"); open(OUT, ">final_output.file"); foreach $chunk ( sort {$a<=>$b} keys %breakpoint ) { sysread( SEQ, $chunkdata, $chunk ); print OUT $chunkdata, $breakpoint{$chunk} } [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re: Re: Re: Inserting text into the middle of a file without clobbering any other text
by graff (Chancellor) on Aug 02, 2002 at 04:16 UTC

I'm going to be working with huge text files, though, so efficiency is a Very Big Deal. Do you know if using a system call to 'cat' will be less efficient than creating the temp file etc?

You could try running some alternatives with the Benchmark module, but I would expect that there are situations where a system call to unix "cat" will be more efficient than doing everything in Perl (and yours may be one such case).

Maybe you want to structure the process so that it can "batch" a bunch of "random-access" inserts into a single "post-edit" loop to interleave the sequential and "random insertion" pieces of data -- e.g. suppose that while you are writing a continuous, sequential output stream, you are actually keeping track of multiple "breakpoints" where future additions may need to be spliced in. Suppose further that, each time you come up with some data that needs to be spliced into one of those breakpoints that were noted earlier in the output, you write this data to some other temp file, or just keep it in a hash array keyed by, say, the byte-offset where its supposed to go. When you get to the end of the sequential output, you can now read that back in portions (from one breakpoint to the next), and interleave those with the appropriate temp files or hash values in the required sequence, to produce the intended final output.

Actually, with this sort of design, I would think that Perl could easily provide the easiest and most effecient method: use sysread on the sequentially written temp file to get the chunks between breakpoints, and write these out to the final file, interleaved with hash array elements that store the "post-hoc insertion" pieces. I hope that's clear, but here's a pseudo-code summary:

open( SEQ, ">sequential_output.file" );
$byteoffset = 0;
while there's data to be written {
   $breakpoint{$byteoffset} = "" if this location might
                   need to get an insertion at a later point
   if I have sequential data {
       print SEQ $sequential_data;
       $byte_offset += length($sequential_data);
   }
   else {  # this is data that needs to be back-fitted to
           # a byteoffset that I stored earlier
       $back_offset = whatever previous byteoffset is right
       $breakpoint{$back_offset} .= $backfit_data;
   }
}
close SEQ;
$breakpoint{$byteoffset} = "";
# do that so the following loop will handle the final chunk

# final loop: put all the pieces together
# in the intended order
open(SEQ, "<sequential_output.file");
open(OUT, ">final_output.file");
foreach $chunk ( sort {$a<=>$b} keys %breakpoint ) {
   sysread( SEQ, $chunkdata, $chunk );
   print OUT $chunkdata, $breakpoint{$chunk}
}
[download]

[reply]
[d/l]