Re: Split a file based on column

I would do something like this (untested):

use strict;
use warnings;
use autodie;

use constant IN_FN => 'sample_1.txt';

my %handles;

open my $infh, '<', IN_FN;

while( <$infh> ) {

  my( $key ) = m/^[^|]\|([^|]+)/;

  if( ! defined $key ) {
    warn "Line $. appears malformed. Skipping: $_";
    next;
  }

  open $handles{$key}, '>', IN_FN . "$key.txt"
    unless exists $handles{$key};

  print {$handles{$key}} $_;

}

close $_ for $infh, values %handles;
[download]

You didn't mention the need, but it would be pretty easy to adapt this to work with a list of input files. Just replace the constant with code to deal with different input filenames, and put it in a loop. :)

What I like about this solution is that you only open each output file once, and then just keep track of the file handles as values in a hash, indexed on the key parsed from the 2nd column.

Update: This solution has the efficiency advantage of not having to re-open an output file if it's already been opened before. But johngg correctly observed that at some point it's possible to get a "Too many open files" error. On one of my systems that kicked in after trying to open 1020 files simultaneously. My solution assumes that column two holds two digits, which would yield just under 100 possible output files. That should be ok.

However, if it turns out that you're exceeding the number of allowable open files on your system, you can open/close on each iteration (the simplest solution).

Dave

Comment on Re: Split a file based on column Download Code

Replies are listed 'Best First'.
Re^2: Split a file based on column by roboticus (Chancellor) on Jan 17, 2013 at 04:19 UTC
davido, brad_nov: I saw davido's solution, and played around with it to add a limit to the number of open files in `%handles` using a least-recently used (LRU) cache. No real reason, but I thought I'd amuse myself while my son got ready for bed. You could trim it down a bit, as much of the code just implements traces to show what's happening as it runs. $ cat t_file_queue.pl #!/usr/bin/perl # Updated PM 1013651 to have a limit on file handles use strict; use warnings; use autodie; use 5.10.0; my %handles; my $MAX_OPEN_FH=3; while( <DATA> ) { my( $key ) = m/^[^\|]\\|([^\|]+)/; if( ! defined $key ) { warn "Line $. appears malformed. Skipping: $_"; next; } print {FH("$key.txt")} $_; } close $$_{FH} for values %handles; sub FH { # Return file handle for named file state $cnt=0; my $key= shift; # Return current handle if it exists if (exists $handles{$key}) { $handles{$key}{cnt}=++$cnt; print "$key: (cnt=$cnt) found\n"; return $handles{$key}{FH}; } # Doesn't exist, retire the "oldest" one if we're at the limit if (keys %handles >= $MAX_OPEN_FH) { my @tmp = sort { $$a{cnt} <=> $$b{cnt} } values %handles; say "$key: Too many open files, close one: ", join(", ",map { "$$_{FName}:$$_{cnt}" } @tmp); my $hr = $tmp[0]; print " closing $$hr{FName}\n"; close $$hr{FH}; delete $handles{$$hr{FName}}; } open my $FH, '>>', $key; $handles{$key} = { cnt=>++$cnt, FName=>$key, FH=>$FH }; print "$key: opened new file ($cnt)\n"; return $FH; } __DATA__ a\|1\|foo b\|1\|bar c\|2\|baz d\|1\|xyzzy e\|2\|blarg f\|2\|The g\|3\|quick h\|2\|red i\|2\|fox j\|3\|jumped k\|4\|over l\|1\|the m\|1\|lazy n\|1\|brown o\|1\|dog p\|5\|gorgonzola [download] Running it gives me: $ ./t_file_queue.pl 1.txt: opened new file (1) 1.txt: (cnt=2) found 2.txt: opened new file (3) 1.txt: (cnt=4) found 2.txt: (cnt=5) found 2.txt: (cnt=6) found 3.txt: opened new file (7) 2.txt: (cnt=8) found 2.txt: (cnt=9) found 3.txt: (cnt=10) found 4.txt: Too many open files, close one: 1.txt:4, 2.txt:9, 3.txt:10 closing 1.txt 4.txt: opened new file (11) 1.txt: Too many open files, close one: 2.txt:9, 3.txt:10, 4.txt:11 closing 2.txt 1.txt: opened new file (12) 1.txt: (cnt=13) found 1.txt: (cnt=14) found 1.txt: (cnt=15) found 5.txt: Too many open files, close one: 3.txt:10, 4.txt:11, 1.txt:15 closing 3.txt 5.txt: opened new file (16) [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^3: Split a file based on column by davido (Cardinal) on Jan 17, 2013 at 06:11 UTC
Good job roboticus. I was thinking instead of some solution that would keep track of frequency of use for opened filehandles. Whenever 'open' fails due to too many files open, drop the least used handle. But I wasn't sure how to implement the frequency structure. A heap (priority queue) sounds good, except that it's probably relatively expensive to update the priority of a file handle each time it's used. Most heap implementations would just delete and re-insert the element being modified. Seems like there must be a solution that isn't prohibitively expensive, but I'm drawing a blank. There must be something on CPAN, but regardless, it would be nice to know how best to implement a...um... "priority cache"? ;) Dave	[reply]
Re^4: Split a file based on column by roboticus (Chancellor) on Jan 17, 2013 at 11:35 UTC
davido: I've used a priority queue in a C program a dozen or so years ago, and it worked well. As far as the overhead goes, I wouldn't expect it to be prohibitive, especially when compared to the time savings of opening a file. Part of the reason I chose an LRU cache for this one is that I've found they work pretty well for the types of applications I use--at least when the number of file handles is more reasonable. Most of the data I play with tends to be 'clumped' in that similar records tend to be closer together. For example, when I process some credit card data, I'll have long runs of Visa transactions, somewhat shorter runs of MasterCard transactions, while others (American Express, Discover) are frequently very short runs. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re^4: Split a file based on column by Anonymous Monk on Jan 17, 2013 at 10:56 UTC
FileCache - keep more files open than the system permits	[reply]
Re^5: Split a file based on column by davido (Cardinal) on Jan 17, 2013 at 18:04 UTC
Re^6: Split a file based on column by Anonymous Monk on Jan 18, 2013 at 00:59 UTC