in reply to Re: split file in N part
in thread split file in N part
If it were me, I'd probably use something like Tie::File to get the number of lines:
Counting lines in large files (which presumably these are, hence the need to split them) is a really terrible way to use Tie::File. To quote the author:
There is a large memory overhead for each record offset and for each cache entry: about 310 bytes per cached data record, and about 21 bytes per offset table entry.
The per-record overhead will limit the maximum number of records you can access per file. Note that accessing the length of the array via $x = scalar @tied_file accesses all records and stores their offsets. The same for foreach (@tied_file), even if you exit the loop early.
A simple:
sub countLines { my $filename = shift; open my $fh, '<', $filename or croak("failed to open '$filename' - $!"); my $count = 0; $count ++ while <$fh> return $count: }
Is far, far (and for very large files; far) more efficient that abusing Tie::File for this. And it is hardly more complex. For very large files, using a larger buffer will save a little more time:
sub countLines { my $filename = shift; open my $fh, '<', $filename or croak("failed to open '$filename' - $!"); my $count = 0; local $/ = \2**26; ## 64MB raise or lower to taste $count += tr[\n][\n] while <$fh>; return $count: }
And File::Split will blow memory if the input file (or combined output file in the case of merge_files()) is larger than memory.
Blind CPANitis serves no one. Teaching it ...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: split file in N part
by jdporter (Paladin) on Mar 10, 2008 at 18:06 UTC | |
by BrowserUk (Patriarch) on Mar 10, 2008 at 18:24 UTC |