Regarding your first pseudo-code snippet, you said:
this is fast clean but creates an unequal distribution of data between files for small number of data objects in a file.

So, let's suppose your input has 27 "data objects", and a particular run is supposed to slice that into 5 parts/files. What would you consider to be the "most equal" distribution over the five output files?

If a distribution like "5, 5, 6, 5, 6" would be okay, then something like this might help:

use strict; my $filename = "file.name"; # or whatever my $obj_count = 0; open( FILE, "<", $filename ) or die "$filename: $!\n"; while (<FILE>) { $obj_count++ if /^SS/; } close FILE; my $part_count = get_some_number(); # depends on ... (command line? D +B?) my $obj_per_part = $obj_count / $part_count; my $break_at_obj = $obj_per_part; open( FILE, "<", $filename ); my $o_index = sprintf( "%03d", 1 ); open( OUT, ">", "$filename.$o_index" ) or die "$filename.$o_index: $!\ +n"; my $obj_done = 0; while (<FILE>) { if ( /^SS/ ) { if ( $obj_done > $break_at_obj ) { close OUT; $o_index++; open( OUT, ">", "$filename.$o_index" ) or die "$filename.$ +o_index: $!\n"; $break_at_obj += $obj_per_part; } $obj_done++; } print OUT; }
That uses a fractional value for the "objects per output", and for deciding when the next output file should be opened ("break_at_obj"); as the number of objects written out is incremented, it will cross the "cut-off" (be greater than "break_at_obj) at "n" or "n+1" iterations, where n=int(obj_count/part_count) -- that is, every output file will contain either "n" or "n+1" objects.

(Update: added "my filename" to code so it would pass strictures, but apart from that the code has not been tested. There might be an "off-by-one" error, meaning that the "$obj_done++" may need to be placed above the test on its value.)


In reply to Re: splitting files by graff
in thread splitting files by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.