in reply to Split a file based on column
All of the above answers seem to have problems with possible filehandle limits; personally I would read the entire file and convert it to a hash of arrays, and then write each array out to a file indicated by the array key. This has the advantage that only one file is open at any time. I will stick my neck out and say it will also be faster due to less file I/O
As a second comment, you should use something like Text::CSV to get the data, but if you want it quick and dirty there's a good argument for using split instead of a regex here.
Amount of Data: 300k rows = 64k per row = approx 19.6GB of data may cause problems, so maybe a compromise is to write the data when an array gets to a certain size.
The following (untested/debugged) shows the idea...it assumes you specify the file(s) you want to read from on the command line.
Update: Changed when it writes to file as a result of a davido comment
use constant ROW_LIMIT => 10000; sub writeData { my ($name, $data) = @_; open FH, ">>", "sample_$name"; print FH @$data; # may not be needed (auto close on sub end) close FH; } my %hash; my $ctr = 0; while (<>) { my @elems = split /|/; my $idx = $elem[1]; if (exists $hash{$idx}) { # save to existing array push @$hash{$idx}, $_; } else { # create new array $hash{$idx} = ( $_); }; # if we've got too much data, write it out if ($ctr++ >= ROW_LIMIT) { # write data to each file... foreach my $key (%hash) { writeData( $key, $hash{ $key}); delete $hash{$key}; } $ctr = 0; } } # write remaining data to each file... foreach my $key (%hash) { writeData( $key, $hash{ $key}); }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Split a file based on column
by Anonymous Monk on Jan 17, 2013 at 10:59 UTC | |
by space_monk (Chaplain) on Jan 17, 2013 at 11:04 UTC | |
by davido (Cardinal) on Jan 17, 2013 at 18:56 UTC |