part - split up files according to column value

Category:	Utility Scripts
Author/Contact Info	Corion
Description:	I often have to split up text files for consumption in Excel. A convenient way of splitting up a file meaningfully is to split it up by the value of a column. The program does this by accumulating the input into a hash of arrays keyed by the column value. There is an `awk` oneliner (I'm told) that circumvents the memory limitations this program encounters: `awk -F '{ print $0 > $3 }' FILES` [download] If you want something like this program in a module, see List::Part Update: Also see join - join two files according to a common key. If you need one, you'll likely need the other too. Update: v0.03 now can part according to more than one column. Update: v0.04 now can output multiple header lines into every file. Update: v0.06 fixes two errors: The column at the end of each line couldn't be used (well) as the key column. Header lines are now actually printed. Update: Now there also is a Github repository for the case that you want to submit a patch. Update: The code is now also on CPAN as App::part.
#!/usr/bin/perl -w use strict; use Getopt::Long; use vars qw($VERSION); $VERSION = '0.06'; # Try to load Pod::Usage and install a fallback if it doesn't exist eval { require Pod::Usage; Pod::Usage->import(); 1; } or do { *pod2usage = sub { die "Error in command line.\n"; }; }; =head1 NAME part - split up a single input file into multiple files according to a + column value =head1 SYNOPSIS part FILES =head1 OPTIONS =item B<--out> - set the output template If the output template is not given it is guessed from the name of the first input file or set to C<part-%s.txt>. The C<%s> will be replaced by the column value. =item B<--column> - set the column to part on This is the zero-based number of the column. Multiple columns may be given. =item B<--separator> - set the column separator This is the separator for the columns. It defaults to a tab character ("\t"). =item B<--header-line> - output the first line into every file This defines the line as header line which is output into every file. If it is given an argument that string is output as header, otherwise the first line read will be repeated as the header. If the value is a number, that many lines will be read from the file and used as the header. This makes it impossible to use just a number as the header. =item B<--verbose> - output the generated filenames In normal operation, the program will be silent. If you need to know the generated filenames, the C<--verbose> option will output them. =item B<--filename-sep> - set the separator for the filenames If you prefer a different separator for the filenames than a newline, this option allows you to set it. If the separator looks like an octal number (three digits) it is interpreted as such. Otherwise it will be taken literally. A common use is to set the separator to C<000> to separate the files by the zero character if you suspect that your filenames might contain newlines. It defaults to C<012>, a newline. =item B<--version> - output version information =head1 CAVEAT The program loads the whole input into RAM before writing the output. A future enhancement might be a C<uniq>-like option that tells the program to assume that the input will be grouped according to the parted column so it does not need to allocate memory. If your memory is not large enough, the following C<awk> one-liner might help you: # Example of parting on column 3 awk -F '{ print $0 > $3 }' FILE =head1 AUTHOR Copyright (c) 2007 Max Maischein (C<< corion@cpan.org >>) =cut GetOptions( 'out=s' => \my $tmpl, 'column=i' => \my @col, 'separator=s' => \my $sep, 'verbose' => \my $verbose, 'filename-sep=s' => \my $filename_sep, 'header-line:s' => \my $header, 'help' => \my $help, 'version' => \my $version, ) or pod2usage(2); pod2usage(1) if $help; if (defined $version) { print "$VERSION\n"; exit 0; }; pod2usage("$0: No files given.") if ((@ARGV == 0) && (-t STDIN)); if (! defined $tmpl) { # Let's hope we can guess from the first filename my $placeholder = '-%s' x @col; ($tmpl = $ARGV[0] \|\| 'part.txt') =~ s/\.(\w+)$/$placeholder.$1/; }; if (! defined $sep) { $sep = "\t"; }; $filename_sep \|\|= "012"; if ($filename_sep =~ /^\d{3}$/) { $filename_sep = chr oct $filename_sep }; my %lines; if (defined $header) { $header \|\|= 1; if ($header =~ /^\d+$/) { my $count = $header; $header = ""; $header .= <> while $count--; }; }; while (<>) { s/\r?\n$//; my @c = split /$sep/o; my $key = join $sep, @c[ @col ]; if (not defined $lines{ $key }) { $lines{ $key } \|\|= []; }; push @{ $lines{$key}}, $_ } for my $key (sort keys %lines) { my @vals = split /$sep/o, $key; my $name = sprintf $tmpl, @vals; open my $fh, ">", $name or die "Couldn't create '$name': $!"; if ($header) { print {$fh} $header; } print "$name$filename_sep" if $verbose; print {$fh} "$_\n" for (@{ $lines{ $key }}); };

Comment on part - split up files according to column value Download Code

Replies are listed 'Best First'.
Re: part - split up files according to column value by jdporter (Paladin) on Feb 12, 2007 at 17:53 UTC
The following writes out each line as it is read in, just like the awk version. The cost is having a potentially very large number of file handles open at once — one per unique value seen in the given column. In fact, you might very easily run into your system's open filehandle limit. :-) This solution avoids a potential problem with file naming. Rather than name the output files with the actual value seen in the field, it uses its own one-up scheme. At the end of processing the input, it prints out a table mapping filenames to column values. It also properly handles the cases where the given column has no content (empty string) and does not exist in the row at all (undef). The output file number zero is reserved for the latter case. # config: my $field = 0; my $sep = "\t"; $, = $sep; $\ = $/; my %file; # { num, name, $fh } my $fnum = 1; while (<>) { chomp; my @c = split /$sep/o; my( $key, $num ) = defined $c[$field] ? ( $c[$field], $fnum++ ) : ( '(column not present)', 0 ); unless ( $file{$key} ) { $file{$key}{num} = $num; $file{$key}{name} = sprintf 'part.%03d', $file{$key}{num}; -f $file{$key}{name} and die "Sorry, '$file{$key}{name}' exists; won't clobber."; open $file{$key}{fh}, ">", $file{$key}{name} or die "Error opening '$file{$key}{name}' for write - $!"; } print {$file{$key}{fh}} @c; } print $file{$_}{name}, $_ for sort { $file{$a}{num} <=> $file{$b}{num} } keys %file; [download] Update: Corion has suggested FileCache as a way to circumvent the open filehandle limit. A word spoken in Mind will reach its own level, in the objective world, by its own weight	[reply] [d/l]
Re^2: part - split up files according to column value by mick2020 (Novice) on Aug 26, 2008 at 10:05 UTC
Hi, Could you give an example of how to run this. I am new to perl. Also how would you incorporate the Filecache in this example. I want to split a file based on the first column and save in file with the name as the name in the first column field without the quotations. ex. data: `"1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "2", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "4", "This" , "is" , "test", "data" "2", "This" , "is" , "test", "data" "3", "This" , "is" , "test", "data"` [download] would create four files named 1,2,3,4 with the data in it. `file 1: "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" file 2: "2", "This" , "is" , "test", "data" "2", "This" , "is" , "test", "data" file 3: "3", "This" , "is" , "test", "data" file 4: "4", "This" , "is" , "test", "data"` [download] It is large file so I need to use the Filecache Thanks For any help	[reply] [d/l] [select]
Re^3: part - split up files according to column value by Corion (Patriarch) on Aug 26, 2008 at 13:31 UTC
You can start by telling us where you encounter problems and what difficulties you have incorporating FileCache into jdporter's code.	[reply]
Re^4: part - split up files according to column value by mick2020 (Novice) on Aug 26, 2008 at 14:37 UTC
Re^5: part - split up files according to column value by Corion (Patriarch) on Aug 26, 2008 at 14:46 UTC
Re^5: part - split up files according to column value by mick2020 (Novice) on Aug 26, 2008 at 15:45 UTC
Some notes below your chosen depth have not been shown here
Re^4: part - split up files according to column value by mick2020 (Novice) on Aug 26, 2008 at 15:55 UTC
Re^2: part - split up files according to column value by Anonymous Monk on Feb 08, 2011 at 11:57 UTC
I have just found this piece of code and it is perfect for a regular task I have. However I am struggling to get headers to work. If I use the header-line switch it ignores the headline for file creation but it does not seem to paste the header in to the top of each file it is creating. Any advice?	[reply]
Re^3: part - split up files according to column value by Corion (Patriarch) on Feb 08, 2011 at 12:05 UTC
jdporter's program has no option for headers. Did you mean to reply to my program? If so, please do consider telling me how you invoke the program, and what the layout of the first few rows of your input file is and what you get for output, so that I can reproduce the problem.	[reply]
Re^4: part - split up files according to column value by Anonymous Monk on Feb 08, 2011 at 12:25 UTC
Re^5: part - split up files according to column value by Anonymous Monk on Feb 08, 2011 at 12:27 UTC
Some notes below your chosen depth have not been shown here
Re: part - split up files according to column value by Corion (Patriarch) on Jul 14, 2009 at 12:30 UTC
As I constantly mislay the version of the program that reads and writes Excel files, here it is: #!/usr/bin/perl -w use strict; use Getopt::Long; use vars qw($VERSION); use Spreadsheet::WriteExcel; use Spreadsheet::ParseExcel; use File::Glob qw(bsd_glob); use ExcelTools; $VERSION = '0.05'; # Try to load Pod::Usage and install a fallback if it doesn't exist eval { require Pod::Usage; Pod::Usage->import(); 1; } or do { *pod2usage = sub { die "Error in command line.\n"; }; }; GetOptions( 'out=s' => \my $tmpl, 'column=i' => \my $col, 'verbose' => \my $verbose, 'header-line:s' => \my $header, 'help' => \my $help, 'version' => \my $version, ) or pod2usage(2); pod2usage(1) if $help; if (defined $version) { print "$VERSION\n"; exit 0; }; pod2usage("$0: No files given.") if ((@ARGV == 0) && (-t STDIN)); if (! defined $tmpl) { # Let's hope we can guess from the first filename ($tmpl = $ARGV[0] \|\| 'part.xls') =~ s/\.(\w+)$/-%s.$1/; }; $col \|\|= 0; $header \|\|= 0; my $header_cols; @ARGV = map { bsd_glob $_ } @ARGV; my %lines; for my $file (@ARGV) { my $wb = Spreadsheet::ParseExcel::Workbook->Parse($file); my $data = ExcelTools::sheet_data($wb->{Worksheet}->[0]); $header_cols = splice @$data, 0, $header; for my $c (@$data) { $lines{ $c->[$col]} \|\|= []; push @{ $lines{$c->[$col]} }, $c }; }; for my $key (sort keys %lines) { (my $clean = $key) =~ s/\s+$//ms; my $name = sprintf $tmpl, $clean; my $out = Spreadsheet::WriteExcel->new($name); my $sheet = $out->add_worksheet('Kandidaten'); unshift @{$lines{ $key }}, $header_cols if $header; $sheet->write_col( 'A1', $lines{ $key }); print "$name\n" #$filename_sep" if $verbose; }; __END__ =head1 NAME xlpart - split up an Excel file into multiple files according to a col +umn value =head1 SYNOPSIS part [OPTIONS] FILES =head1 OPTIONS =item B<--version> - print program version Outputs the program version. =item B<--help> - print this page Outputs this help text. =item B<--out> - set the output template If the output template is not given it is guessed from the name of the first input file or set to C<part-%s.txt>. The C<%s> will be replaced by the column value. =item B<--column> - set the column to part on This is the zero-based number of the column. =item B<--header-line> - output the first line into every file This defines the line as header line which is output into every file. If it is given an argument that string is output as header, otherwise the first line read will be repeated as the header. If the value is a number, that many lines will be read from the file and used as the header. This makes it impossible to use just a number as the header. =item B<--verbose> - output the generated filenames In normal operation, the program will be silent. If you need to know the generated filenames, the B<--verbose> option will output them. =head1 CAVEAT The program loads the whole input into RAM before writing the output. A future enhancement might be a C<uniq>-like option that tells the program to assume that the input will be grouped according to the parted column so it does not need to allocate memory. If your memory is not large enough, the following C<awk> one-liner might help you: # Example of parting on column 3 awk -F '{ print $0 > $3 }' FILE =head1 AUTHOR Max Maischein (C<< corion@cpan.org >>) [download]	[reply] [d/l]
Re: part - split up files according to column value by toolic (Bishop) on Feb 09, 2011 at 15:50 UTC
There are a couple minor POD errors in version 6: `perldoc part ... POD ERRORS Hey! The above document had some coding errors, which are expla +ined below: Around line 29: '=item' outside of any '=over' Around line 77: You forgot a '=back' before '=head1'` [download] Here is a patch: `--- part.6 2011-02-09 10:44:22.000000000 -0500 +++ part.fix 2011-02-09 10:45:32.000000000 -0500 @@ -26,6 +26,8 @@ =head1 OPTIONS +=over + =item B<--out> - set the output template If the output template is not given it is guessed from @@ -74,6 +76,8 @@ =item B<--version> - output version information +=back + =head1 CAVEAT The program loads the whole input into RAM` [download] Also, it would be helpful if you could add a few example command lines in your POD, like the one you showed in Re^6: part - split up files according to column value.	[reply] [d/l] [select]
Re^2: part - split up files according to column value by sl7020 (Initiate) on Jul 10, 2012 at 21:26 UTC
Monks: Trying to modify script to give me a subtotal of column 14 (see below.) `while (<>) { s/\r?\n$//; my @c = split /$sep/o; my $sub_total = 0; $sub_total += $c[14]; my $key = join $sep, @c[ @col ]; if (not defined $lines{ $key }) { $lines{ $key } \|\|= []; }; push @{ $lines{$key}}, $_ push @{ $totals{$key}}, $sub_total }` [download] My code is failing with the following: "my" variable %lines masks earlier declaration in same scope at ./part-v2.pl line 179. "my" variable $key masks earlier declaration in same statement at ./part-v2.pl line 180. syntax error at ./part-v2.pl line 176, near "$_ push" syntax error at ./part-v2.pl line 176, near "}}" Global symbol "$sub_total" requires explicit package name at ./part-v2.pl line 177. syntax error at ./part-v2.pl line 191, near "}" Execution of ./part-v2.pl aborted due to compilation errors. ======================== Any help would be appreciated.	[reply] [d/l]
Re^3: part - split up files according to column value by aaron_baugher (Curate) on Jul 10, 2012 at 22:39 UTC
Statements end with semicolons. Blocks do not: `} push @{ $lines{$key}}, $_; push @{ $totals{$key}}, $sub_total; }` [download] You also have a problem in that you instantiate `$sub_total` with a value of zero every time through the loop, right before you add a value to it, so it will always equal that value. This is probably not what you want. You should move the `my` line that creates it to before your loop, so it can accumulate inside the loop. Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l] [select]


"be consistent"
	PerlMonks