Re: Split a file based on column

All of the above answers seem to have problems with possible filehandle limits; personally I would read the entire file and convert it to a hash of arrays, and then write each array out to a file indicated by the array key. This has the advantage that only one file is open at any time. I will stick my neck out and say it will also be faster due to less file I/O

As a second comment, you should use something like Text::CSV to get the data, but if you want it quick and dirty there's a good argument for using split instead of a regex here.

Amount of Data: 300k rows = 64k per row = approx 19.6GB of data may cause problems, so maybe a compromise is to write the data when an array gets to a certain size.

The following (untested/debugged) shows the idea...it assumes you specify the file(s) you want to read from on the command line.

Update: Changed when it writes to file as a result of a davido comment

use constant ROW_LIMIT => 10000;

sub writeData {
    my ($name, $data) = @_;
    open FH, ">>", "sample_$name";
    print FH @$data;
    # may not be needed (auto close on sub end)
    close FH;
}

my %hash;
my $ctr = 0;
while (<>) {
   my @elems = split /|/;
   my $idx = $elem[1];
   if (exists $hash{$idx}) {
       # save to existing array
       push @$hash{$idx}, $_;
   } else {
       # create new array
       $hash{$idx} = ( $_);
   };

   # if we've got too much data, write it out
   if ($ctr++ >= ROW_LIMIT) {
       # write data to each file...
       foreach my $key (%hash) {
          writeData( $key, $hash{ $key}); delete $hash{$key};
       }
       $ctr = 0;
   }
}

# write remaining data to each file...
foreach my $key (%hash) {
   writeData( $key, $hash{ $key});
}
[download]

A Monk aims to give answers to those who have none, and to learn from those who know more.

Comment on Re: Split a file based on column Download Code

Replies are listed 'Best First'.
Re^2: Split a file based on column by Anonymous Monk on Jan 17, 2013 at 10:59 UTC
All of the above answers seem to have problems with possible filehandle limits; Re: Split a file based on column doesn't , also doesn't suffer from load-file-into-ram	[reply]
Re^3: Split a file based on column by space_monk (Chaplain) on Jan 17, 2013 at 11:04 UTC
You caught my comment whilst it was being drafted; I did state another reason for the approach I suggested. Memory is almost never a problem nowadays unless you're running it on your 15 year old PC, but 300k rows * 64 k per row (19GB??) may give some pause for thought. Time to go shopping for more memory or increase your cache. :-) A Monk aims to give answers to those who have none, and to learn from those who know more.	[reply]
Re^4: Split a file based on column by davido (Cardinal) on Jan 17, 2013 at 18:56 UTC
Loading a 19GB file into memory does indeed give pause for thought.... long long pause. :) Time enough to contemplate approaches that do scale well. Your accumulate and write when full strategy is a pretty good idea. It would be a data cache rather than a filehandle cache, and the implementation ought to be pretty straight forward. Implementing the file-handle LFU cache seems like it would be more fun though. Dave	[reply]