in reply to Spliting Table

You should at least supply the data separators (field separator and record separator, if any), or perhaps better provide a small sample of your file.

And also explain how you intend to extract columns 14-22 from a file with 9 columns per person. But it is probably me not understanding your input format. Please explain.

Also please note that with a 3.4 GB input file, you might end up with millions of output files, I am not sure that all file systems can cope with that and, even if yours does, it might not be very practical.

Replies are listed 'Best First'.
Re^2: Spliting Table
by tschelineli (Initiate) on Aug 20, 2015 at 10:01 UTC
    So I don't know whether it really helps, but here's part of my "motherfile". There are more columns for more VP's and there are more lines for more TargetID's. So the motherfile contains all data and I want to extract per VP one file with the first four columns and the belonging VP*.***-columns. Huh, I'm sorry that I'm not really capable of explaining it well.
    open(RAUS,$outfile); while (<REIN>) { chomp(); @we = split(/\t/); $fix = "$we[0]\t$we[1]\t$we[2]\t$we[3]"; $out1 = "$we[4]\t$we[5]\t$we[6]\t$we[7]\t$we[8]\t$we[9]\t$we[10]\t +$we[11]\t$we[12]"; print RAUS $fix ."\t" .$out1."\n"; } close(RAUS); close(REIN);
     This is part of my very nasty code for printing the right stuff for one VP, but it is highly useless for the entire file.. =)
    Here's my data:
    
    Index	TargetID	ProbeID_A	ProbeID_B	VP1.AVG_Beta	VP1.Intensity	VP1.Avg_NBEADS_A	VP1.Avg_NBEADS_B	VP1.BEAD_STDERR_A	VP1.BEAD_STDERR_B	VP1.Signal_A	VP1.Signal_B	VP1.Detection Pval	VP2.AVG_Beta	VP2.Intensity	VP2.Avg_NBEADS_A	VP2.Avg_NBEADS_B	VP2.BEAD_STDERR_A	VP2.BEAD_STDERR_B	VP2.Signal_A	VP2.Signal_B	VP2.Detection Pval
    1	cg00000029	14782418	14782418	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
    2	cg00000108	12709357	12709357	0,9218118	3609	12	12	44,74464	155,0186	190	3419	0,00	0,9602971	7305	11	11	35,27683	130,8559	194	7111	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
    3	cg00000109	59755374	59755374	0,650519	767	4	4	51,5	151,5	203	564	0,00	0,8245264	1906	10	10	24,03331	136,6104	252	1654	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
    4	cg00000165	12637463	12637463	0,3073516	1029	20	20	59,47941	28,39806	682	347	0,00	0,2899073	1842	17	17	80,52183	51,41755	1279	563	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
    5	cg00000236	12649348	12649348	0,8236473	1397	14	14	18,17377	105,0337	164	1233	0,00	0,8691943	3065	13	13	42,71191	160,031	314	2751	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
    6	cg00000289	18766346	18766346	0,4625375	901	14	14	48,37429	46,23619	438	463	0,00	0,590708	1256	11	11	78,69446	89,85038	455	801	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
      Please, put your data inside <code>...</code> tags as well. It makes it easier and less error prone to download it.

      Are the data-files separated by a "tab" character and the records separated by an EOL code?

      It seems like you have a data-file with records of 22 fields each according to the header, but with 40 fields according to the data-records. How is that possible?

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics

        i just pasted a part of my "motherfile" in here. and your absolutely right, there was a mistake in it, but here's the new one, in code-format.. =) yes, the file is tab-delimited and newline at the end of each record.

        Index TargetID ProbeID_A ProbeID_B VP1.AVG_Beta VP1.Int +ensity VP1.Avg_NBEADS_A VP1.Avg_NBEADS_B VP1.BEAD_STDERR_A + VP1.BEAD_STDERR_B VP1.Signal_A VP1.Signal_B VP1.Detection +Pval VP2.AVG_Beta VP2.Intensity VP2.Avg_NBEADS_A VP2.Avg_ +NBEADS_B VP2.BEAD_STDERR_A VP2.BEAD_STDERR_B VP2.Signal_A + VP2.Signal_B VP2.Detection Pval 1 cg00000029 14782418 14782418 0,7469755 2793 15 +15 33,82405 72,03749 632 2161 0,00 0,6678689 295 +0 18 18 96,40222 126,8078 913 2037 0,00 2 cg00000108 12709357 12709357 0,9218118 3609 12 +12 44,74464 155,0186 190 3419 0,00 0,9602971 730 +5 11 11 35,27683 130,8559 194 7111 0,00 3 cg00000109 59755374 59755374 0,650519 767 4 4 + 51,5 151,5 203 564 0,00 0,8245264 1906 10 1 +0 24,03331 136,6104 252 1654 0,00 4 cg00000165 12637463 12637463 0,3073516 1029 20 +20 59,47941 28,39806 682 347 0,00 0,2899073 1842 + 17 17 80,52183 51,41755 1279 563 0,00 5 cg00000236 12649348 12649348 0,8236473 1397 14 +14 18,17377 105,0337 164 1233 0,00 0,8691943 306 +5 13 13 42,71191 160,031 314 2751 0,00 6 cg00000289 18766346 18766346 0,4625375 901 14 1 +4 48,37429 46,23619 438 463 0,00 0,590708 1256 + 11 11 78,69446 89,85038 455 801 0,00
      This is part of my very nasty code for printing the right stuff for one VP, but it is highly useless for the entire file
      No, you almost have it (at least if I understood correctly what you want to do). Your code is reading the file line by line, which is what you need. For each line, you just need to pickup the relevant fields and print them out.

      You only need to open a new output file for each input record.

      You did not say how you want to name your files, so I will pick up the first field as it seems to be a line number.

      while (<REIN>) { chomp(); my @we = split(/\t/); my $out_file_name = "out_file_nr_$we[0]"; open my $RAUS, ">", $out_file_name or die "could not open $out_fil +e_name $!"; my $fix = "$we[0]\t$we[1]\t$we[2]\t$we[3]"; my $out1 = "$we[4]\t$we[5]\t$we[6]\t$we[7]\t$we[8]\t$we[9]\t$we[10 +]\t$we[11]\t$we[12]"; print $RAUS $fix ."\t" .$out1."\n"; close $RAUS; } close(REIN)
      I do not understand, though, why you are splitting the data into fields and then join them back into the original format before printing it. So this could be simply:
      while (<$REIN>) { my $out_file_name = "out_file_nr_[$.]"; # $. is the line number on + the last read file handle open my $RAUS, ">", $out_file_name or die "could not open $out_fil +e_name $!"; print $RAUS $_; close $RAUS; } close(REIN)
      Or did I miss something in your requirement?

        As I understand it the input file has hundred of columns like this (transposed)

        Columns Row 1 Row 2 ... etc for 1000s rows Index 1 2 TargetID g00000029 cg00000108 ProbeID_A 14782418 12709357 ProbeID_B 14782418 12709357 VP1.AVG_Beta 0,7469755 0,9218118 VP1.Intensity 2793 3609 VP1.Avg_NBEADS_A 15 12 VP1.Avg_NBEADS_B 15 12 VP1.BEAD_STDERR_A 3,382,405 4,474,464 VP1.BEAD_STDERR_B 7,203,749 1,550,186 VP1.Signal_A 632 190 VP1.Signal_B 2161 3419 VP1.Detection Pval 0,00 0,00 VP2.AVG_Beta 0,6678689 0,9602971 VP2.Intensity 2950 7305 VP2.Avg_NBEADS_A 18 11 VP2.Avg_NBEADS_B 18 11 VP2.BEAD_STDERR_A 9,640,222 3,527,683 VP2.BEAD_STDERR_B 1,268,078 1,308,559 VP2.Signal_A 913 194 VP2.Signal_B 2037 7111 VP2.Detection Pval 0,00 0,00 .. .. repeated for hundreds of VPs

        and the OP wants to create 100's of individual files like this (but transposed)

        File VP1.txt ------------ Index 1 2 ... etc for 1000s rows TargetID g00000029 cg00000108 ProbeID_A 14782418 12709357 ProbeID_B 14782418 12709357 VP1.AVG_Beta 0,7469755 0,9218118 VP1.Intensity 2793 3609 VP1.Avg_NBEADS_A 15 12 VP1.Avg_NBEADS_B 15 12 VP1.BEAD_STDERR_A 3,382,405 4,474,464 VP1.BEAD_STDERR_B 7,203,749 1,550,186 VP1.Signal_A 632 190 VP1.Signal_B 2161 3419 VP1.Detection Pval 0,00 0,00 File VP2.txt ------------ Index 1 2 ... etc for 1000s rows TargetID g00000029 cg00000108 ProbeID_A 14782418 12709357 ProbeID_B 14782418 12709357 VP2.AVG_Beta 0,6678689 0,9602971 VP2.Intensity 2950 7305 VP2.Avg_NBEADS_A 18 11 VP2.Avg_NBEADS_B 18 11 VP2.BEAD_STDERR_A 9,640,222 3,527,683 VP2.BEAD_STDERR_B 1,268,078 1,308,559 VP2.Signal_A 913 194 VP2.Signal_B 2037 7111 VP2.Detection Pval 0,00 0,00
        poj