in reply to Re^2: Spliting Table
in thread Spliting Table

Please, put your data inside <code>...</code> tags as well. It makes it easier and less error prone to download it.

Are the data-files separated by a "tab" character and the records separated by an EOL code?

It seems like you have a data-file with records of 22 fields each according to the header, but with 40 fields according to the data-records. How is that possible?

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

Replies are listed 'Best First'.
Re^4: Spliting Table
by tschelineli (Initiate) on Aug 20, 2015 at 10:38 UTC

    i just pasted a part of my "motherfile" in here. and your absolutely right, there was a mistake in it, but here's the new one, in code-format.. =) yes, the file is tab-delimited and newline at the end of each record.

    Index TargetID ProbeID_A ProbeID_B VP1.AVG_Beta VP1.Int +ensity VP1.Avg_NBEADS_A VP1.Avg_NBEADS_B VP1.BEAD_STDERR_A + VP1.BEAD_STDERR_B VP1.Signal_A VP1.Signal_B VP1.Detection +Pval VP2.AVG_Beta VP2.Intensity VP2.Avg_NBEADS_A VP2.Avg_ +NBEADS_B VP2.BEAD_STDERR_A VP2.BEAD_STDERR_B VP2.Signal_A + VP2.Signal_B VP2.Detection Pval 1 cg00000029 14782418 14782418 0,7469755 2793 15 +15 33,82405 72,03749 632 2161 0,00 0,6678689 295 +0 18 18 96,40222 126,8078 913 2037 0,00 2 cg00000108 12709357 12709357 0,9218118 3609 12 +12 44,74464 155,0186 190 3419 0,00 0,9602971 730 +5 11 11 35,27683 130,8559 194 7111 0,00 3 cg00000109 59755374 59755374 0,650519 767 4 4 + 51,5 151,5 203 564 0,00 0,8245264 1906 10 1 +0 24,03331 136,6104 252 1654 0,00 4 cg00000165 12637463 12637463 0,3073516 1029 20 +20 59,47941 28,39806 682 347 0,00 0,2899073 1842 + 17 17 80,52183 51,41755 1279 563 0,00 5 cg00000236 12649348 12649348 0,8236473 1397 14 +14 18,17377 105,0337 164 1233 0,00 0,8691943 306 +5 13 13 42,71191 160,031 314 2751 0,00 6 cg00000289 18766346 18766346 0,4625375 901 14 1 +4 48,37429 46,23619 438 463 0,00 0,590708 1256 + 11 11 78,69446 89,85038 455 801 0,00
      Is it me or can anyone else make out which data goes to which header? Maybe upload the file and link it. EDIT: The above data isnt tab delimited, it is space delimited. use /\s+/ when splitting.

      I wasn't sure if this would scale to 3+GB but I tested it with a 500MB file (100 vps and 100_000 lines) and it took <1 minute.

      #!perl use strict; use warnings; my %head = (); my @vp = (); my %fh = (); my $width; my $t0 = time(); my $infile = '500M.dat'; # read header open IN,'<',$infile or die "could not open $infile : $!"; chomp( my $line1 = <IN> ); my @head = split "\t", $line1; # scan across the columns my $k = 3; # repeat fields for my $c ($k+1..$#head){ my ($vp,$attr) = split '\.',$head[$c]; # open new filehandle for each vp if (not exists $fh{$vp}){ my $outfile = "out_$vp.dat"; open $fh{$vp},'>',$outfile or die "Could not open $outfile : $!"; push @vp,$vp; @{$head{$vp}} = @head[0..$k+1]; print "Opened $outfile for $vp\n"; } else { push @{$head{$vp}},$head[$c]; } ++$width if (@vp < 2) } print "Width = $width\n"; # write headers to outfiles for (keys %fh){ print { $fh{$_} } (join "\t",@{$head{$_}})."\n"; } # process file my $count = 1; while (<IN>){ chomp; my @f = split "\t",$_; my $begin = 4; for my $vp (@vp){ my $end = $begin + $width - 1; #print "$vp $begin $end\n"; print { $fh{$vp} } (join "\t",@f[0..3,$begin..$end])."\n"; # move along to next vp $begin = $begin + $width; } ++$count; } # close out files for (keys %fh){ close $fh{$_}; print "File closed for $_\n"; } my $dur = time - $t0; print "$count lines read from $infile\n"; print scalar @vp." files created in $dur seconds\n";
      update : header line corrected to include AVG_Beta
      poj
        One question, why do you immediately open a filehandle for each VP and keep it open? Is there no risk that you will run out of filehandles if there are (perhaps) thousands of different VPs?

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics