in reply to split file in N part

Not bad. But I think you have a bug. Probably just a typo. I believe

my $nbLinesPerFile=$totalCount;
should be
my $nbLinesPerFile=int( $totalCount / $nbFiles );
Right?

Aside from that, there's a number of places I think your code could be improved. In particular, I'd try to make more use of existing modules.

One useful thing is to use Carp when issuing error messages, since it tends to give better context.

In sub countLines, you should localize $_. Even better would be to use a lexical variable for the purpose. If it were me, I'd probably use something like Tie::File to get the number of lines:

use Tie::File; use Carp; sub countLines { my $filename = shift; tie my @array, 'Tie::File', $filename or croak("failed to tie '$filename' - $!"); scalar @array }
Such a tied array could also be used in the other sub, for iterating over the lines of input.

You also need to do error checking on all your open calls — and be sure to include the "reason" ($!) in the error message!

Ultimately, you could just use File::Split:

use File::Split; File::Split ->new({keepSource=>'1'}) ->split_file({parts=>$nbFiles},$newfile);
:-)

A word spoken in Mind will reach its own level, in the objective world, by its own weight

Replies are listed 'Best First'.
Re^2: split file in N part
by BrowserUk (Patriarch) on Mar 10, 2008 at 15:27 UTC
    If it were me, I'd probably use something like Tie::File to get the number of lines:

    Counting lines in large files (which presumably these are, hence the need to split them) is a really terrible way to use Tie::File. To quote the author:

    There is a large memory overhead for each record offset and for each cache entry: about 310 bytes per cached data record, and about 21 bytes per offset table entry.

    The per-record overhead will limit the maximum number of records you can access per file. Note that accessing the length of the array via $x = scalar @tied_file accesses all records and stores their offsets. The same for foreach (@tied_file), even if you exit the loop early.

    A simple:

    sub countLines { my $filename = shift; open my $fh, '<', $filename or croak("failed to open '$filename' - $!"); my $count = 0; $count ++ while <$fh> return $count: }

    Is far, far (and for very large files; far) more efficient that abusing Tie::File for this. And it is hardly more complex. For very large files, using a larger buffer will save a little more time:

    sub countLines { my $filename = shift; open my $fh, '<', $filename or croak("failed to open '$filename' - $!"); my $count = 0; local $/ = \2**26; ## 64MB raise or lower to taste $count += tr[\n][\n] while <$fh>; return $count: }

    And File::Split will blow memory if the input file (or combined output file in the case of merge_files()) is larger than memory.

    Blind CPANitis serves no one. Teaching it ...


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      "accessing the length of the array via $x = scalar @tied_file accesses all records and stores their offsets."

      Yowch! Missed that bit.

      Blind CPANitis serves no one.

      It is unfortunate that some modules have limitations. (In fact, I personally don't bother with Tie::File, ever since I noticed that using it to modify file contents can cause corruption.) But I don't believe that we should ignore CPAN or discourage the use of modules. IMHO, one shouldn't worry about the limitations of a module unless and until one has reason to believe that one's usage will be affected. And you'll notice that I referred to File::Split as an afterthought, not as a first and only suggestion.

      A word spoken in Mind will reach its own level, in the objective world, by its own weight
        It is unfortunate that some modules have limitations.

        A file slitting module that cannot split files greater than it can hold in memory is more than a little bit limited.

        But I don't believe that we should ignore CPAN or discourage the use of modules.

        It's not a matter of "discouraging the use of CPAN". I've never done that, and never would.

        It's not blindly suggesting the use of modules for either a) inappropriate uses; b) or because the "name sounds right", without having looked inside to what they actually do.

        Especially as a replacement for existing, working code that doesn't have the limitations of the module you are suggesting.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: split file in N part
by jeepj (Scribe) on Mar 10, 2008 at 14:29 UTC

    You are right jdporter this is a typo ( in fact, a modification done for some obscure testing that I didn't revert correctly). I am correcting the post. I will also take a look at your proposal for improvement.

    Regarding the usage of File::Split, I have some concerns for two reasons:

    • I want to keep control on the resulting files names
    • I am using the less possible number of CPAN packages, as a lot of my scripts are running on my team mates PC, and they just have the "basic" Perl installation. The only additional package required for the moment is Tk to have some GUIs.