Re^5: Spliting Table

I wasn't sure if this would scale to 3+GB but I tested it with a 500MB file (100 vps and 100_000 lines) and it took <1 minute.

#!perl
use strict;
use warnings;

my %head = (); 
my @vp   = ();
my %fh   = ();
my $width;

my $t0 = time();
my $infile = '500M.dat';

# read header
open IN,'<',$infile or die "could not open $infile : $!";
chomp( my $line1 = <IN> );
my @head = split "\t", $line1;

# scan across the columns
my $k = 3; # repeat fields
for my $c ($k+1..$#head){
  my ($vp,$attr) = split '\.',$head[$c];
  
  # open new filehandle for each vp
  if (not exists $fh{$vp}){
    my $outfile = "out_$vp.dat";
    open $fh{$vp},'>',$outfile or die "Could not open $outfile : $!";
    push @vp,$vp;
    @{$head{$vp}} = @head[0..$k+1];
    print "Opened $outfile for $vp\n";
  } else { 
    push @{$head{$vp}},$head[$c];
  }
  
  ++$width if (@vp < 2)
}
print "Width = $width\n";

# write headers to outfiles
for (keys %fh){
  print { $fh{$_} } (join "\t",@{$head{$_}})."\n";
}

# process file
my $count = 1;
while (<IN>){
 chomp;
 my @f = split "\t",$_;
 my $begin = 4;
 for my $vp (@vp){
   my $end = $begin + $width - 1;
   #print "$vp $begin $end\n";
   print { $fh{$vp} } (join "\t",@f[0..3,$begin..$end])."\n";
   # move along to next vp
   $begin = $begin + $width;
 }
 ++$count;
}

# close out files
for (keys %fh){
  close $fh{$_};
  print "File closed for $_\n";
}

my $dur = time - $t0;
print "$count lines read from $infile\n";
print scalar @vp." files created in $dur seconds\n";
[download]

update : header line corrected to include AVG_Beta
poj

Comment on Re^5: Spliting Table Download Code

Replies are listed 'Best First'.
Re^6: Spliting Table by CountZero (Bishop) on Aug 21, 2015 at 08:07 UTC
One question, why do you immediately open a filehandle for each VP and keep it open? Is there no risk that you will run out of filehandles if there are (perhaps) thousands of different VPs? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re^7: Spliting Table by poj (Abbot) on Aug 21, 2015 at 08:42 UTC
Yes there is that risk but here Re^2: Spliting Table (a solution) the OP said 'as there are over 100people' which I took to mean less than 200 poj	[reply]