splitting files

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

i need help, or a suggestion on this topic. let say a have a file (ascii) that contains some data, and what i need to do is to split this file into several smaller ones so that integrity of data in the file remains. example:

file to be divided :

 
SS line a
line a
line a
line a
SS line b
line b
line b
line b
line b
line b
SS line c
line c
line c
...
[download]

the point is not to have part of the 'lines a' in one file and a part in the other one. like:

file 1
SS line a
line a

file 2
line a
line a
[download]

number of parts is dynamic , it changes from time to time. so som obvious algorithm's would be :


open (FILE);
my $data = 0;
my $partnumber = 5;  # changes constantly
while(<FILE>){$data++ if ($_ =~ /^SS/)} # to get the number of objects
+ in a file
close FILE;
my $chunk =($data + $partnumber)/$partnumber ; # to ensure ther are no
+ remainders
my $index = 1;
open(FILE$index);
open (FILE);
my $i = 0;
while(<FILE>){
 if (m/^SS/){
   $i++;
   if ($i >= $chunk){
    $i = 0;
    close FILE$index;
    $index++;
    open (FILE$index);
   }
}

print FILE$index "$_";
}
close FILE & FILE$index;
[download]

this is fast clean but creates an unequal distribution of data between files for small number of data objects in a file.

second one would be to do something like this

open (FILE);
my $data = 0;
my $partnumber = 5;  # changes constantly
while(<FILE>){$data++ if ($_ =~ /^SS/)} # to get the number of objects
+ in a file
close FILE;
my $chunk = $data/$partnumber;
my $remainder = $data%$partnumber ;
my $index = 1;
open(FILE$index);
open (FILE);
my $i = 0;
my $remain = 1;
while(<FILE>){
 if (m/^SS/){
   $i++;
   $chunk +=1 if ( $remain < $remainder);
   if ($i >= $chunk){
    $i = 0;
    $remain++;
    close FILE$index;
    $index++;
    open (FILE$index);
   }
}
print FILE$index "$_";
}
close FILE & FILE$index;
[download]

it resolves the problem of first one with an extra evaluation line (not big deal but ...) so my question is is there a simpler way to do this or this is the simplest one ? and is there a module that does this things so i can see or even copy the procedure from it if it resolves this problem faster and simpler ? thank you .

Update:

it was ment to be a pseudo code, just to point out dividing algorithms:


chunk = (total + # of chunks)/ # of chunks


and 

remainder = total % # of chunks
chunk = total / # of chunk
foreach (chunk #){
  if (remainder < # of chunks){
    add one to ensure that all data is divided between files
  }
}
[download]

so as you can see the problem is how to elegantly divide data between files to ensure there is no data corruption and that all data is divided between files

thnx

Comment on splitting files Select or Download Code

Replies are listed 'Best First'.
Re: splitting files by jwkrahn (Abbot) on Mar 14, 2009 at 22:10 UTC
Perhaps you want something like this: `my $index = 1; while ( <> ) { if ( /^SS/ ) { open OUT, '>', "$ARGV$index" or die "Cannot open '$ARGV$index' + $!"; ++$index; } print OUT; }` [download]	[reply] [d/l]
Re: splitting files by codeacrobat (Chaplain) on Mar 14, 2009 at 22:06 UTC
`perl -pe '/^SS/ and (open STDOUT, ">",$ARGV . $count++ or die)' thebi +gfile.txt` [download] should split nicely to thebigfile.txt0, thebigfile.txt1... `print+qq(\L@{[ref\&@]}@{['@'x7^'!#2/"!4']});`	[reply] [d/l] [select]
Re: splitting files by ELISHEVA (Prior) on Mar 15, 2009 at 06:38 UTC
Did you try to run code using either of these algorithms or was your posted code meant as psuedo-code? `FILE$index` generates a syntax error. If you are opening files sequentially it is also unnecessary. You need to change the name of the file associated with the file handle, not the name of the file handle. See open for details. Did you intend to be doing integer division when you calculated chunk size? If so, `$data/$partnumber` needs to be `int($data/$partnumber)`. As it stands `$chunk` is a fraction. The second algorithm does not allocate lines as I think you were intending: For data with 3 SS lines all data ends up in a single file. If you go up to 5 SS lines, your first file has 0 objects and all remaining files have only one object. If you go up to 6 or more SS lines objects get allocated a bit more evenly, but not necessarily total lines. For example, if you have a run of small objects that add up to `$chunk-1` followed by one super big one they could all end up in the same file. If it is important to even out the total number of lines in each file as much as possible, then you might want to read up on optimization algorithms, particularly partitioning algorithms. There is no simple way to do this. A similar problem was discussed just a few days ago (see partition of an array). Although the problem discusses partitioning an array in 2, the goal is the same: even-ing out the sums among N buckets. In your case you are summing lines associated with objects rather than numbers in an array, but the basic problem is the same. As you read through that thread, pay particular attention to the dialog between Limbic~Region and BrowserUK and also the back and forth between sundialsvc4 and ikegami. Best, beth	[reply] [d/l] [select]
Re: splitting files by graff (Chancellor) on Mar 15, 2009 at 22:40 UTC
Regarding your first pseudo-code snippet, you said: this is fast clean but creates an unequal distribution of data between files for small number of data objects in a file. So, let's suppose your input has 27 "data objects", and a particular run is supposed to slice that into 5 parts/files. What would you consider to be the "most equal" distribution over the five output files? If a distribution like "5, 5, 6, 5, 6" would be okay, then something like this might help: use strict; my $filename = "file.name"; # or whatever my $obj_count = 0; open( FILE, "<", $filename ) or die "$filename: $!\n"; while (<FILE>) { $obj_count++ if /^SS/; } close FILE; my $part_count = get_some_number(); # depends on ... (command line? D +B?) my $obj_per_part = $obj_count / $part_count; my $break_at_obj = $obj_per_part; open( FILE, "<", $filename ); my $o_index = sprintf( "%03d", 1 ); open( OUT, ">", "$filename.$o_index" ) or die "$filename.$o_index: $!\ +n"; my $obj_done = 0; while (<FILE>) { if ( /^SS/ ) { if ( $obj_done > $break_at_obj ) { close OUT; $o_index++; open( OUT, ">", "$filename.$o_index" ) or die "$filename.$ +o_index: $!\n"; $break_at_obj += $obj_per_part; } $obj_done++; } print OUT; } [download] That uses a fractional value for the "objects per output", and for deciding when the next output file should be opened ("break_at_obj"); as the number of objects written out is incremented, it will cross the "cut-off" (be greater than "break_at_obj) at "n" or "n+1" iterations, where n=int(obj_count/part_count) -- that is, every output file will contain either "n" or "n+1" objects. (Update: added "my filename" to code so it would pass strictures, but apart from that the code has not been tested. There might be an "off-by-one" error, meaning that the "$obj_done++" may need to be placed above the test on its value.)	[reply] [d/l]