this is fast clean but creates an unequal distribution of data between files for small number of data objects in a file.
So, let's suppose your input has 27 "data objects", and a particular run is supposed to slice that into 5 parts/files. What would you consider to be the "most equal" distribution over the five output files?
If a distribution like "5, 5, 6, 5, 6" would be okay, then something like this might help:
That uses a fractional value for the "objects per output", and for deciding when the next output file should be opened ("break_at_obj"); as the number of objects written out is incremented, it will cross the "cut-off" (be greater than "break_at_obj) at "n" or "n+1" iterations, where n=int(obj_count/part_count) -- that is, every output file will contain either "n" or "n+1" objects.use strict; my $filename = "file.name"; # or whatever my $obj_count = 0; open( FILE, "<", $filename ) or die "$filename: $!\n"; while (<FILE>) { $obj_count++ if /^SS/; } close FILE; my $part_count = get_some_number(); # depends on ... (command line? D +B?) my $obj_per_part = $obj_count / $part_count; my $break_at_obj = $obj_per_part; open( FILE, "<", $filename ); my $o_index = sprintf( "%03d", 1 ); open( OUT, ">", "$filename.$o_index" ) or die "$filename.$o_index: $!\ +n"; my $obj_done = 0; while (<FILE>) { if ( /^SS/ ) { if ( $obj_done > $break_at_obj ) { close OUT; $o_index++; open( OUT, ">", "$filename.$o_index" ) or die "$filename.$ +o_index: $!\n"; $break_at_obj += $obj_per_part; } $obj_done++; } print OUT; }
(Update: added "my filename" to code so it would pass strictures, but apart from that the code has not been tested. There might be an "off-by-one" error, meaning that the "$obj_done++" may need to be placed above the test on its value.)
In reply to Re: splitting files
by graff
in thread splitting files
by baxy77bax
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |