in reply to Re: split file in N part
in thread split file in N part

If it were me, I'd probably use something like Tie::File to get the number of lines:

Counting lines in large files (which presumably these are, hence the need to split them) is a really terrible way to use Tie::File. To quote the author:

There is a large memory overhead for each record offset and for each cache entry: about 310 bytes per cached data record, and about 21 bytes per offset table entry.

The per-record overhead will limit the maximum number of records you can access per file. Note that accessing the length of the array via $x = scalar @tied_file accesses all records and stores their offsets. The same for foreach (@tied_file), even if you exit the loop early.

A simple:

sub countLines { my $filename = shift; open my $fh, '<', $filename or croak("failed to open '$filename' - $!"); my $count = 0; $count ++ while <$fh> return $count: }

Is far, far (and for very large files; far) more efficient that abusing Tie::File for this. And it is hardly more complex. For very large files, using a larger buffer will save a little more time:

sub countLines { my $filename = shift; open my $fh, '<', $filename or croak("failed to open '$filename' - $!"); my $count = 0; local $/ = \2**26; ## 64MB raise or lower to taste $count += tr[\n][\n] while <$fh>; return $count: }

And File::Split will blow memory if the input file (or combined output file in the case of merge_files()) is larger than memory.

Blind CPANitis serves no one. Teaching it ...


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^3: split file in N part
by jdporter (Paladin) on Mar 10, 2008 at 18:06 UTC
    "accessing the length of the array via $x = scalar @tied_file accesses all records and stores their offsets."

    Yowch! Missed that bit.

    Blind CPANitis serves no one.

    It is unfortunate that some modules have limitations. (In fact, I personally don't bother with Tie::File, ever since I noticed that using it to modify file contents can cause corruption.) But I don't believe that we should ignore CPAN or discourage the use of modules. IMHO, one shouldn't worry about the limitations of a module unless and until one has reason to believe that one's usage will be affected. And you'll notice that I referred to File::Split as an afterthought, not as a first and only suggestion.

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
      It is unfortunate that some modules have limitations.

      A file slitting module that cannot split files greater than it can hold in memory is more than a little bit limited.

      But I don't believe that we should ignore CPAN or discourage the use of modules.

      It's not a matter of "discouraging the use of CPAN". I've never done that, and never would.

      It's not blindly suggesting the use of modules for either a) inappropriate uses; b) or because the "name sounds right", without having looked inside to what they actually do.

      Especially as a replacement for existing, working code that doesn't have the limitations of the module you are suggesting.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.