comment on

Regarding your first pseudo-code snippet, you said:

this is fast clean but creates an unequal distribution of data between files for small number of data objects in a file.

So, let's suppose your input has 27 "data objects", and a particular run is supposed to slice that into 5 parts/files. What would you consider to be the "most equal" distribution over the five output files?

If a distribution like "5, 5, 6, 5, 6" would be okay, then something like this might help:

use strict;

my $filename = "file.name"; # or whatever

my $obj_count = 0;
open( FILE, "<", $filename ) or die "$filename: $!\n";
while (<FILE>) {
    $obj_count++ if /^SS/;
}
close FILE;

my $part_count = get_some_number();  # depends on ... (command line? D
+B?)
my $obj_per_part = $obj_count / $part_count;
my $break_at_obj = $obj_per_part;

open( FILE, "<", $filename );

my $o_index = sprintf( "%03d", 1 );
open( OUT, ">", "$filename.$o_index" ) or die "$filename.$o_index: $!\
+n";
my $obj_done = 0;

while (<FILE>) {
    if ( /^SS/ ) {
        if ( $obj_done > $break_at_obj ) {
            close OUT;
            $o_index++;
            open( OUT, ">", "$filename.$o_index" ) or die "$filename.$
+o_index: $!\n";
            $break_at_obj += $obj_per_part;
        }
        $obj_done++;
    }
    print OUT;
}
[download]

That uses a fractional value for the "objects per output", and for deciding when the next output file should be opened ("break_at_obj"); as the number of objects written out is incremented, it will cross the "cut-off" (be greater than "break_at_obj) at "n" or "n+1" iterations, where n=int(obj_count/part_count) -- that is, every output file will contain either "n" or "n+1" objects.

(Update: added "my filename" to code so it would pass strictures, but apart from that the code has not been tested. There might be an "off-by-one" error, meaning that the "$obj_done++" may need to be placed above the test on its value.)

In reply to Re: splitting files by graff
in thread splitting files by baxy77bax

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.