splitting files by number of words

josephs has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: splitting files by number of words by moritz (Cardinal) on Aug 05, 2009 at 21:58 UTC
First go through the file, counting the number of words. Divide that value by ten. Then go through the file once again, this time stopping when you need to split, open a new file, and write the next words to it It's no rocket science - try it, and if have problems to code some of this, come back with more specific questions.	[reply]
Re: splitting files by number of words by BrowserUk (Patriarch) on Aug 05, 2009 at 22:57 UTC
If you need the words to remain in the same order, and teh file is too big for ememory, then you'll need two passes al la moritz suggestion. However, if which file each word ends up in doesn't matter, then you can open 10 output files and process the file in one pass, by writing to each output file alternately: `#! perl -slw use strict; use constant NFILES => 10; my @fhs; open $fhs[ $_ ], '>', 'output.' . $_ or die $! for 0 .. NFILES - 1; my $iFhs = 0; while( <> ) { for my $word ( split '\W+' ) { print { $fhs[ $iFhs ] } $word; ++$iFhs; $iFhs %= NFILES ; } } close $_ for @fhs;` [download] You might need to change the regex to match your definition of words. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW	[reply] [d/l]
Re: splitting files by number of words by ww (Archbishop) on Aug 05, 2009 at 22:03 UTC
In a classic "welcome" to new folks: 1. What have you tried? Please see On asking for help and How do I post a question effectively? (and, not just BTW, because you'll find suggestions there about how to ask a good question, see also Markup in the Monastery). 2. Another classic (twisted almost beyond recognition in case your answer to 1 is "nothing)": How would you do it with paper and scissors? And welcome to PM. You'll really find a lot of help bere, but there is a third classic: 3. This is not a code-a-matic. There's no slot for your nickle/tuppence/Euro. But you will get back much more than you put in, when you show us your effort(s). Update: Closed paren. Duh!	[reply]
Re: splitting files by number of words by bichonfrise74 (Vicar) on Aug 05, 2009 at 23:43 UTC
This will get you started. The difference between this and your question is that I'm outputting the result in a hash of array. But you can simply change that to output to a file instead. `#!/usr/bin/perl use strict; use Data::Dumper; my (@total_words, %record_file_of); my $counter = 0; my $num_of_files = 3; @total_words = split while (<DATA>); my $words_per_file = int( scalar @total_words / $num_of_files ); for my $i (0 .. $#total_words) { $counter++ if ( $i % $words_per_file == 0 ); push( @{ $record_file_of{$counter} } , $total_words[$i] ); } print Dumper \%record_file_of; __DATA__ This is a test of words. This should be divided into equal files.` [download]	[reply] [d/l]
Re^2: splitting files by number of words by Fletch (Bishop) on Aug 06, 2009 at 13:09 UTC
Erm . . . `@total_words = split while (<DATA>);` [download] only works because you've got a single line of test data. With more than one line you'd wind up only getting the number of words in the last line processed. The correct way to do what you're attempting would be along the lines of `push @total_words, split`; however you'd then wind up keeping all of the words in memory which, given the original constraint of "very large files", is probably not going to be viable. The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l] [select]
Re^3: splitting files by number of words by si_lence (Deacon) on Aug 06, 2009 at 13:26 UTC
or you could just use `$total_words += split while(<DATA>);` [download] cheers, si_lence	[reply] [d/l]
Re^3: splitting files by number of words by bichonfrise74 (Vicar) on Aug 06, 2009 at 16:58 UTC
Thanks, I didn't notice that.	[reply]