breaking up very large text files

antichef has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: breaking up very large text files - windows by dga (Hermit) on Jul 28, 2003 at 20:32 UTC
One possibility is to `seek` to the end of the file and work back from there. If you don't need exactly 1000 lines another possibility is to seek to some distance from the end say 1k bytes, read and ignore one line (probably part of a line) then output the rest of the file to your smallfile. Of course if the log messages are a fixed length then the math to get 1000 lines from the end is easy to do. Though since you want to tail the files then some number of kilobytes then reading forward to the next linebreak to get the start of a line should probably work and quickly to boot. `open(BIG, "name_here"); open(SMALL, "other_name_here"); seek(BIG, -1024, 2); my $junk=<BIG>; #pitch to end of current line while(<BIG>) { print SMALL $_; }` [download]	[reply] [d/l] [select]
Re: breaking up very large text files - windows by fglock (Vicar) on Jul 28, 2003 at 20:44 UTC
See Perl Power Tools "tail" - you may want to build on it.	[reply]
Re: breaking up very large text files - windows by CountZero (Bishop) on Jul 28, 2003 at 20:55 UTC
Another way of doing it is going through your big file and filling an array of (say) 1000 lines (or as many as you would like to keep) replacing the oldest lines by newer lines in a round-robin-type of fashion. At the end of the big file you have your 1000 lines in the array. If you don't know beforehand how long the lines are, I think it is the best and fastest way of doing it (you don't have to read in your big file twice as you did), with the exception perhaps of the modules already suggested. Update: As a matter of fact this is the way the Perl Power Tools tail-function works also (but it has a lot more options, which you perhaps do not need) and hereafter follows the relevant code from the Power Tools (slightly adapted): `while(<$fh>) { $i++; $buf[$i%($p)] = $_; } my @tail = (@buf[ ($i%($p) + 1) .. $#buf ], @buf[ 0 .. $i%($p)]); for (@tail) { print if $_; }` [download] $fh is the filehandle to your big file and $p is the number of lines you need CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l]
Re: breaking up very large text files - windows by skyknight (Hermit) on Jul 28, 2003 at 21:36 UTC
I just came up with this solution on the fly, and haven't ever had to do this before in Perl before now. It could probably be optimized a bit, and might not handle some cosmically weird special case, like maybe a boundary falling right on a newline, or seeking past the beginning of the file (shouldn't be a problem with a huge file, but would be good to handle all the same), but it works on some decent base cases I tried out. I picked an arbitrary block size, but you could override it... sub print_trailing_lines { my $file = shift(); # file to read my $num_lines = shift(); # number of lines to grab my $block_size = shift() \|\| 1024; # size of blocks to slurp my @lines = (); #array of lines open(FILE, "<$file") or die "could not open $file for reading: $!"; seek(FILE, -$block_size, 2); # go to the end, minus a block while ($num_lines) { # while we've got more lines to grab... my $block = undef; read(FILE, $block, $block_size, 0); # suck in a block seek(FILE, -2 * $block_size, 1); # back up in file by two blocks my @chunks = split /\n/, $block; # split block into lines # cat last line from this block with first line of previous block push(@lines, pop(@chunks) . shift(@lines)) if @lines; # deal with fact that current block # might have more lines than we want shift(@chunks) while(@chunks > $num_lines); # subsume this block's lines unshift(@lines, @chunks); # make note of how many lines we grabbed $num_lines -= scalar(@chunks); } close(FILE); print join("\n", @lines), "\n"; } [download] Probably not a perfect solution, but it should give you the basis for how to solve the problem. I think both this and the above mentioned "round robin" solution have their advantages. With this solution, you need to know roughly the size of lines if you want a sane block size, but you can override that as necessary. The real advantage is you don't have the horribly wasteful expenditure of reading through the whole file. The "round robin" solution, I think, will only cut your wait time in half. My solution's execution time, however, should not vary with file size.	[reply] [d/l]
Re: breaking up very large text files - windows by Cody Pendant (Prior) on Jul 28, 2003 at 23:41 UTC
Just one other suggestion that nobody's come up with -- is it possible to use some kind of Tie module and treat the lines of the file as an array? `($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print` [download]	[reply] [d/l]
Re: Re: breaking up very large text files - windows by CountZero (Bishop) on Jul 29, 2003 at 06:33 UTC
Good idea, but if the big file is really BIG, perhaps one runs into memory problems (I guess it will depend on how the `tie` is done) and somehow one still has to run through the file to see where the individual lines start, so you can `tie` your array to them. In the same vien I was thinking of some DBI::DBD solution (there are DBD-drivers for flat-file database-files) but it also needs to work through the file to find the individual records, unless you have fixed length records and then the matter is trivial to solve in any case. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l] [select]
Re: breaking up very large text files - windows by derby (Abbot) on Jul 28, 2003 at 20:31 UTC
The first step in having a usable windows box is too install cygwin. -derby	[reply]