Re^2: split file in N part

If it were me, I'd probably use something like Tie::File to get the number of lines:

Counting lines in large files (which presumably these are, hence the need to split them) is a really terrible way to use Tie::File. To quote the author:

There is a large memory overhead for each record offset and for each cache entry: about 310 bytes per cached data record, and about 21 bytes per offset table entry.
The per-record overhead will limit the maximum number of records you can access per file. Note that accessing the length of the array via $x = scalar @tied_file accesses all records and stores their offsets. The same for foreach (@tied_file), even if you exit the loop early.

A simple:

sub countLines {
    my $filename = shift;
    open my $fh, '<', $filename
        or croak("failed to open '$filename' - $!");
    my $count = 0;
    $count ++ while <$fh>
    return $count:
}
[download]

Is far, far (and for very large files; far) more efficient that abusing Tie::File for this. And it is hardly more complex. For very large files, using a larger buffer will save a little more time:

sub countLines {
    my $filename = shift;
    open my $fh, '<', $filename
        or croak("failed to open '$filename' - $!");
    my $count = 0;
    local $/ = \2**26; ## 64MB raise or lower to taste
    $count += tr[\n][\n] while <$fh>;
    return $count:
}
[download]

And File::Split will blow memory if the input file (or combined output file in the case of merge_files()) is larger than memory.

Blind CPANitis serves no one. Teaching it ...

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^2: split file in N part Select or Download Code

Replies are listed 'Best First'.
Re^3: split file in N part by jdporter (Paladin) on Mar 10, 2008 at 18:06 UTC
"accessing the length of the array via $x = scalar @tied_file accesses all records and stores their offsets." Yowch! Missed that bit. Blind CPANitis serves no one. It is unfortunate that some modules have limitations. (In fact, I personally don't bother with Tie::File, ever since I noticed that using it to modify file contents can cause corruption.) But I don't believe that we should ignore CPAN or discourage the use of modules. IMHO, one shouldn't worry about the limitations of a module unless and until one has reason to believe that one's usage will be affected. And you'll notice that I referred to File::Split as an afterthought, not as a first and only suggestion. A word spoken in Mind will reach its own level, in the objective world, by its own weight	[reply]
Re^4: split file in N part by BrowserUk (Patriarch) on Mar 10, 2008 at 18:24 UTC
It is unfortunate that some modules have limitations. A file slitting module that cannot split files greater than it can hold in memory is more than a little bit limited. But I don't believe that we should ignore CPAN or discourage the use of modules. It's not a matter of "discouraging the use of CPAN". I've never done that, and never would. It's not blindly suggesting the use of modules for either a) inappropriate uses; b) or because the "name sounds right", without having looked inside to what they actually do. Especially as a replacement for existing, working code that doesn't have the limitations of the module you are suggesting. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]