josephs has asked for the wisdom of the Perl Monks concerning the following question:

Greetings and salutations! I need to split very large text files into ten equal-sized files (not by byte size) but by the number of words in each file. Word is defined as any alphanumeric string followed by a space or puntuation mark. Thanks for any ideas!

Replies are listed 'Best First'.
Re: splitting files by number of words
by moritz (Cardinal) on Aug 05, 2009 at 21:58 UTC
    First go through the file, counting the number of words. Divide that value by ten.

    Then go through the file once again, this time stopping when you need to split, open a new file, and write the next words to it

    It's no rocket science - try it, and if have problems to code some of this, come back with more specific questions.

Re: splitting files by number of words
by BrowserUk (Patriarch) on Aug 05, 2009 at 22:57 UTC

    If you need the words to remain in the same order, and teh file is too big for ememory, then you'll need two passes al la moritz suggestion.

    However, if which file each word ends up in doesn't matter, then you can open 10 output files and process the file in one pass, by writing to each output file alternately:

    #! perl -slw use strict; use constant NFILES => 10; my @fhs; open $fhs[ $_ ], '>', 'output.' . $_ or die $! for 0 .. NFILES - 1; my $iFhs = 0; while( <> ) { for my $word ( split '\W+' ) { print { $fhs[ $iFhs ] } $word; ++$iFhs; $iFhs %= NFILES ; } } close $_ for @fhs;

    You might need to change the regex to match your definition of words.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: splitting files by number of words
by ww (Archbishop) on Aug 05, 2009 at 22:03 UTC
    In a classic "welcome" to new folks:

    1. What have you tried?

    Please see On asking for help and How do I post a question effectively? (and, not just BTW, because you'll find suggestions there about how to ask a good question, see also Markup in the Monastery).

    2. Another classic (twisted almost beyond recognition in case your answer to 1 is "nothing)": How would you do it with paper and scissors?

    And welcome to PM. You'll really find a lot of help bere, but there is a third classic:

    3. This is not a code-a-matic. There's no slot for your nickle/tuppence/Euro. But you will get back much more than you put in, when you show us your effort(s).

    Update: Closed paren. Duh!

Re: splitting files by number of words
by bichonfrise74 (Vicar) on Aug 05, 2009 at 23:43 UTC
    This will get you started. The difference between this and your question is that I'm outputting the result in a hash of array. But you can simply change that to output to a file instead.
    #!/usr/bin/perl use strict; use Data::Dumper; my (@total_words, %record_file_of); my $counter = 0; my $num_of_files = 3; @total_words = split while (<DATA>); my $words_per_file = int( scalar @total_words / $num_of_files ); for my $i (0 .. $#total_words) { $counter++ if ( $i % $words_per_file == 0 ); push( @{ $record_file_of{$counter} } , $total_words[$i] ); } print Dumper \%record_file_of; __DATA__ This is a test of words. This should be divided into equal files.

      Erm . . .

      @total_words = split while (<DATA>);

      only works because you've got a single line of test data. With more than one line you'd wind up only getting the number of words in the last line processed. The correct way to do what you're attempting would be along the lines of push @total_words, split; however you'd then wind up keeping all of the words in memory which, given the original constraint of "very large files", is probably not going to be viable.

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

        or you could just use
        $total_words += split while(<DATA>);
        cheers, si_lence
        Thanks, I didn't notice that.