in reply to part - split up files according to column value

The following writes out each line as it is read in, just like the awk version. The cost is having a potentially very large number of file handles open at once — one per unique value seen in the given column. In fact, you might very easily run into your system's open filehandle limit. :-)

This solution avoids a potential problem with file naming. Rather than name the output files with the actual value seen in the field, it uses its own one-up scheme. At the end of processing the input, it prints out a table mapping filenames to column values.

It also properly handles the cases where the given column has no content (empty string) and does not exist in the row at all (undef). The output file number zero is reserved for the latter case.

# config: my $field = 0; my $sep = "\t"; $, = $sep; $\ = $/; my %file; # { num, name, $fh } my $fnum = 1; while (<>) { chomp; my @c = split /$sep/o; my( $key, $num ) = defined $c[$field] ? ( $c[$field], $fnum++ ) : ( '(column not present)', 0 ); unless ( $file{$key} ) { $file{$key}{num} = $num; $file{$key}{name} = sprintf 'part.%03d', $file{$key}{num}; -f $file{$key}{name} and die "Sorry, '$file{$key}{name}' exists; won't clobber."; open $file{$key}{fh}, ">", $file{$key}{name} or die "Error opening '$file{$key}{name}' for write - $!"; } print {$file{$key}{fh}} @c; } print $file{$_}{name}, $_ for sort { $file{$a}{num} <=> $file{$b}{num} } keys %file;

Update: Corion has suggested FileCache as a way to circumvent the open filehandle limit.

A word spoken in Mind will reach its own level, in the objective world, by its own weight

Replies are listed 'Best First'.
Re^2: part - split up files according to column value
by mick2020 (Novice) on Aug 26, 2008 at 10:05 UTC
    Hi, Could you give an example of how to run this. I am new to perl. Also how would you incorporate the Filecache in this example. I want to split a file based on the first column and save in file with the name as the name in the first column field without the quotations. ex. data:
    "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "2", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "4", "This" , "is" , "test", "data" "2", "This" , "is" , "test", "data" "3", "This" , "is" , "test", "data"
    would create four files named 1,2,3,4 with the data in it.
    file 1: "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" "1", "This" , "is" , "test", "data" file 2: "2", "This" , "is" , "test", "data" "2", "This" , "is" , "test", "data" file 3: "3", "This" , "is" , "test", "data" file 4: "4", "This" , "is" , "test", "data"
    It is large file so I need to use the Filecache Thanks For any help

      You can start by telling us where you encounter problems and what difficulties you have incorporating FileCache into jdporter's code.

        I have the first part of the task completed i.e. sorting the files. Here is my code
        ///My code use FileCache maxOpen => 1000; //////////// # config: my $field = 0; my $sep = ","; ////MY code cacheout $mode, $path; $fh = cacheout $mode, $path; ///////// $, = $sep; $\ = $/; my %file; # { num, name, $fh } my $fnum = 1; while (<>) { chomp; my @c = split /$sep/o; my( $key, $num ) = defined $c[$field] ? ( $c[$field], $fnum++ ) : ( '(column not present)', 0 ); unless ( $file{$key}) { $nameF = $c[$field]; $nameF =~ s/"//g; $file{$key}{num} = $num; $file{$key}{name} = "out/".$nameF.$ARGV[0]; if(($file{$key}{num}) >1){ -f $file{$key}{name} and die "Sorry, '$file{$key}{name}' exists; won't clobber."; open $file{$key}{fh}, ">", $file{$key}{name} or die "Error opening '$file{$key}{name}' for write - $!"; }} print {$file{$key}{fh}} @c; }
        The problem is the filecache. I am not familiar with perl so I am having problems with this part of code.
        I am getting error $ perl split.pl Input.csv .cvs Error opening '4444.cvs' for write - Too many open files at split.pl 39, <> line 817961.
        I have marked the my addition to jdporter's code
        I don't know $path and $mode are.
        I have tried
        use FileCache maxOpen => 10000;
        ..
        open $file{$key}{fh}, ">", cacheout $file{$key}{name} or die
        But I get the error
        Too many open files at /usr/lib/perl5/5.10/ .... at line 408948
        I have tried changing the value of maxOpen but this does nothing
Re^2: part - split up files according to column value
by Anonymous Monk on Feb 08, 2011 at 11:57 UTC
    I have just found this piece of code and it is perfect for a regular task I have. However I am struggling to get headers to work. If I use the header-line switch it ignores the headline for file creation but it does not seem to paste the header in to the top of each file it is creating. Any advice?

      jdporter's program has no option for headers. Did you mean to reply to my program?

      If so, please do consider telling me how you invoke the program, and what the layout of the first few rows of your input file is and what you get for output, so that I can reproduce the problem.

        Yes - sorry I was referring to your program Corion. I invoke it using the following: part.pl filename -header-line=1 -column=3 The first line of the file contains the header. The output files are all correct, except for the lack of header. Here is sample input field1 field2 field3 split zweiradlinss 89803607 330525685618 3 zweiradlinss 89803607 310286767428 77 I am splitting on the last column. It generates two files. One for line 2 and another for line 3. Neither has the header row.