in reply to breaking up very large text files - windows

I just came up with this solution on the fly, and haven't ever had to do this before in Perl before now. It could probably be optimized a bit, and might not handle some cosmically weird special case, like maybe a boundary falling right on a newline, or seeking past the beginning of the file (shouldn't be a problem with a huge file, but would be good to handle all the same), but it works on some decent base cases I tried out. I picked an arbitrary block size, but you could override it...

sub print_trailing_lines { my $file = shift(); # file to read my $num_lines = shift(); # number of lines to grab my $block_size = shift() || 1024; # size of blocks to slurp my @lines = (); #array of lines open(FILE, "<$file") or die "could not open $file for reading: $!"; seek(FILE, -$block_size, 2); # go to the end, minus a block while ($num_lines) { # while we've got more lines to grab... my $block = undef; read(FILE, $block, $block_size, 0); # suck in a block seek(FILE, -2 * $block_size, 1); # back up in file by two blocks my @chunks = split /\n/, $block; # split block into lines # cat last line from this block with first line of previous block push(@lines, pop(@chunks) . shift(@lines)) if @lines; # deal with fact that current block # might have more lines than we want shift(@chunks) while(@chunks > $num_lines); # subsume this block's lines unshift(@lines, @chunks); # make note of how many lines we grabbed $num_lines -= scalar(@chunks); } close(FILE); print join("\n", @lines), "\n"; }

Probably not a perfect solution, but it should give you the basis for how to solve the problem. I think both this and the above mentioned "round robin" solution have their advantages. With this solution, you need to know roughly the size of lines if you want a sane block size, but you can override that as necessary. The real advantage is you don't have the horribly wasteful expenditure of reading through the whole file. The "round robin" solution, I think, will only cut your wait time in half. My solution's execution time, however, should not vary with file size.