comment on

I'm going to be working with huge text files, though, so efficiency is a Very Big Deal. Do you know if using a system call to 'cat' will be less efficient than creating the temp file etc?

You could try running some alternatives with the Benchmark module, but I would expect that there are situations where a system call to unix "cat" will be more efficient than doing everything in Perl (and yours may be one such case).

Maybe you want to structure the process so that it can "batch" a bunch of "random-access" inserts into a single "post-edit" loop to interleave the sequential and "random insertion" pieces of data -- e.g. suppose that while you are writing a continuous, sequential output stream, you are actually keeping track of multiple "breakpoints" where future additions may need to be spliced in. Suppose further that, each time you come up with some data that needs to be spliced into one of those breakpoints that were noted earlier in the output, you write this data to some other temp file, or just keep it in a hash array keyed by, say, the byte-offset where its supposed to go. When you get to the end of the sequential output, you can now read that back in portions (from one breakpoint to the next), and interleave those with the appropriate temp files or hash values in the required sequence, to produce the intended final output.

Actually, with this sort of design, I would think that Perl could easily provide the easiest and most effecient method: use sysread on the sequentially written temp file to get the chunks between breakpoints, and write these out to the final file, interleaved with hash array elements that store the "post-hoc insertion" pieces. I hope that's clear, but here's a pseudo-code summary:

open( SEQ, ">sequential_output.file" );
$byteoffset = 0;
while there's data to be written {
   $breakpoint{$byteoffset} = "" if this location might
                   need to get an insertion at a later point
   if I have sequential data {
       print SEQ $sequential_data;
       $byte_offset += length($sequential_data);
   }
   else {  # this is data that needs to be back-fitted to
           # a byteoffset that I stored earlier
       $back_offset = whatever previous byteoffset is right
       $breakpoint{$back_offset} .= $backfit_data;
   }
}
close SEQ;
$breakpoint{$byteoffset} = "";
# do that so the following loop will handle the final chunk

# final loop: put all the pieces together
# in the intended order
open(SEQ, "<sequential_output.file");
open(OUT, ">final_output.file");
foreach $chunk ( sort {$a<=>$b} keys %breakpoint ) {
   sysread( SEQ, $chunkdata, $chunk );
   print OUT $chunkdata, $breakpoint{$chunk}
}
[download]

In reply to Re: Re: Re: Inserting text into the middle of a file without clobbering any other text by graff
in thread Inserting text into the middle of a file without clobbering any other text by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.