in reply to part - split up files according to column value
The following writes out each line as it is read in, just like the awk version. The cost is having a potentially very large number of file handles open at once — one per unique value seen in the given column. In fact, you might very easily run into your system's open filehandle limit. :-)
This solution avoids a potential problem with file naming. Rather than name the output files with the actual value seen in the field, it uses its own one-up scheme. At the end of processing the input, it prints out a table mapping filenames to column values.
It also properly handles the cases where the given column has no content (empty string) and does not exist in the row at all (undef). The output file number zero is reserved for the latter case.
# config: my $field = 0; my $sep = "\t"; $, = $sep; $\ = $/; my %file; # { num, name, $fh } my $fnum = 1; while (<>) { chomp; my @c = split /$sep/o; my( $key, $num ) = defined $c[$field] ? ( $c[$field], $fnum++ ) : ( '(column not present)', 0 ); unless ( $file{$key} ) { $file{$key}{num} = $num; $file{$key}{name} = sprintf 'part.%03d', $file{$key}{num}; -f $file{$key}{name} and die "Sorry, '$file{$key}{name}' exists; won't clobber."; open $file{$key}{fh}, ">", $file{$key}{name} or die "Error opening '$file{$key}{name}' for write - $!"; } print {$file{$key}{fh}} @c; } print $file{$_}{name}, $_ for sort { $file{$a}{num} <=> $file{$b}{num} } keys %file;
Update: Corion has suggested FileCache as a way to circumvent the open filehandle limit.
|
|---|