So I don't know whether it really helps, but here's part of my "motherfile". There are more columns for more VP's and there are more lines for more TargetID's.
So the motherfile contains all data and I want to extract per VP one file with the first four columns and the belonging VP*.***-columns.
Huh, I'm sorry that I'm not really capable of explaining it well.
open(RAUS,$outfile);
while (<REIN>) {
chomp();
@we = split(/\t/);
$fix = "$we[0]\t$we[1]\t$we[2]\t$we[3]";
$out1 = "$we[4]\t$we[5]\t$we[6]\t$we[7]\t$we[8]\t$we[9]\t$we[10]\t
+$we[11]\t$we[12]";
print RAUS $fix ."\t" .$out1."\n";
}
close(RAUS);
close(REIN);
This is part of my very nasty code for printing the right stuff for one VP, but it is highly useless for the entire file.. =)
Here's my data:
Index TargetID ProbeID_A ProbeID_B VP1.AVG_Beta VP1.Intensity VP1.Avg_NBEADS_A VP1.Avg_NBEADS_B VP1.BEAD_STDERR_A VP1.BEAD_STDERR_B VP1.Signal_A VP1.Signal_B VP1.Detection Pval VP2.AVG_Beta VP2.Intensity VP2.Avg_NBEADS_A VP2.Avg_NBEADS_B VP2.BEAD_STDERR_A VP2.BEAD_STDERR_B VP2.Signal_A VP2.Signal_B VP2.Detection Pval
1 cg00000029 14782418 14782418 0,7469755 2793 15 15 33,82405 72,03749 632 2161 0,00 0,6678689 2950 18 18 96,40222 126,8078 913 2037 0,00 0,7469755 2793 15 15 33,82405 72,03749 632 2161 0,00 0,6678689 2950 18 18 96,40222 126,8078 913 2037 0,00
2 cg00000108 12709357 12709357 0,9218118 3609 12 12 44,74464 155,0186 190 3419 0,00 0,9602971 7305 11 11 35,27683 130,8559 194 7111 0,00 0,7469755 2793 15 15 33,82405 72,03749 632 2161 0,00 0,6678689 2950 18 18 96,40222 126,8078 913 2037 0,00
3 cg00000109 59755374 59755374 0,650519 767 4 4 51,5 151,5 203 564 0,00 0,8245264 1906 10 10 24,03331 136,6104 252 1654 0,00 0,7469755 2793 15 15 33,82405 72,03749 632 2161 0,00 0,6678689 2950 18 18 96,40222 126,8078 913 2037 0,00
4 cg00000165 12637463 12637463 0,3073516 1029 20 20 59,47941 28,39806 682 347 0,00 0,2899073 1842 17 17 80,52183 51,41755 1279 563 0,00 0,7469755 2793 15 15 33,82405 72,03749 632 2161 0,00 0,6678689 2950 18 18 96,40222 126,8078 913 2037 0,00
5 cg00000236 12649348 12649348 0,8236473 1397 14 14 18,17377 105,0337 164 1233 0,00 0,8691943 3065 13 13 42,71191 160,031 314 2751 0,00 0,7469755 2793 15 15 33,82405 72,03749 632 2161 0,00 0,6678689 2950 18 18 96,40222 126,8078 913 2037 0,00
6 cg00000289 18766346 18766346 0,4625375 901 14 14 48,37429 46,23619 438 463 0,00 0,590708 1256 11 11 78,69446 89,85038 455 801 0,00 0,7469755 2793 15 15 33,82405 72,03749 632 2161 0,00 0,6678689 2950 18 18 96,40222 126,8078 913 2037 0,00 | [reply] [d/l] |
This is part of my very nasty code for printing the right stuff for one VP, but it is highly useless for the entire file
No, you almost have it (at least if I understood correctly what you want to do). Your code is reading the file line by line, which is what you need. For each line, you just need to pickup the relevant fields and print them out.
You only need to open a new output file for each input record.
You did not say how you want to name your files, so I will pick up the first field as it seems to be a line number.
while (<REIN>) {
chomp();
my @we = split(/\t/);
my $out_file_name = "out_file_nr_$we[0]";
open my $RAUS, ">", $out_file_name or die "could not open $out_fil
+e_name $!";
my $fix = "$we[0]\t$we[1]\t$we[2]\t$we[3]";
my $out1 = "$we[4]\t$we[5]\t$we[6]\t$we[7]\t$we[8]\t$we[9]\t$we[10
+]\t$we[11]\t$we[12]";
print $RAUS $fix ."\t" .$out1."\n";
close $RAUS;
}
close(REIN)
I do not understand, though, why you are splitting the data into fields and then join them back into the original format before printing it. So this could be simply:
while (<$REIN>) {
my $out_file_name = "out_file_nr_[$.]"; # $. is the line number on
+ the last read file handle
open my $RAUS, ">", $out_file_name or die "could not open $out_fil
+e_name $!";
print $RAUS $_;
close $RAUS;
}
close(REIN)
Or did I miss something in your requirement?
| [reply] [d/l] [select] |
#!/usr/bin/perl --
## tinking.pl
## 2015-08-20-02:28:04
##
##
##
##
## perltidy -olq -csc -csci=10 -cscl="sub : BEGIN END if while for "
+-otr -opr -ce -nibc -i=4 -pt=0 "-nsak=*"
## perltidy -olq -csc -csci=3 -cscl="sub : BEGIN END " -otr -opr -ce
+-nibc -i=4 -pt=0 "-nsak=*"
#!/usr/bin/perl --
use strict;
use warnings;
use autodie qw/ open close /;
use Data::Dump qw/ dd /;
my $infile = \q{
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyza
cdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzab
};
Roger( $infile );
exit( 0 );
sub Roger {
my( $infile ) = @_;
open my( $infh ), '<', $infile; ## "or die..." by autodie
my $count = 5;
while( <$infh> ) {
next if !/\S/;
my @what = split //, $_;
#~ dd( \@what );
my $right = $count + 8;
@what = grep defined, @what[ 0 .. 4, $count .. $right ];
#~ dd( \@what );
Dodger( \@what );
$count = $right;
} ## end while( <$infh> )
} ## end sub Roger
sub Dodger {
my( $arref ) = @_;
print join ',', @$arref, "\n";
} ## end sub Dodger
| [reply] [d/l] |
It seems that you have genome wide methylation data
measured using Illumina bead chips, and your file has been
generated as a genome studio final report, correct?
In this case I'd recommend considering the following
points:
- You may be better off not starting from the report
file but from the idat files instead (the original
files that went into the genome studio project) and
use the appropriate R-packages (methylumi, minfi,
etc.) from the Bioconductor repository for processing
the data
- Of course this requires sufficient memory. So
the machine should have at least 8 GB RAM available
(better 16)
- There are some packages that claim to be able to
load genome studio reports (methylumi). But I'm not
sure about that.
- If you still want to split the file using perl,
the script should work for any number of columns per
individual. Thus avoid hard coding this number in the
file, as you may discover that you are missing some
information and you have to regenerate the report with
different settings and a different number of columns
per subject.
- Unfortunately many available programs for data
processing (in particular some R functions in several
packages) are not capable of obeying i18n rules and
thus just use "." as decimal separator. Or they
use the encoding in inconsistent manner. To ensure
data integrity it seems best to change your language
settings to English and as well to change all ","
in the data file to ".". So one of the 1st things to
do for each line after reading in would be "tr/,/./;",
if you are using "," as regular decimal separator -
like in your provided data file.
- Under the precondition of enough available file
handles it is possible to generate all output files
in one pass (Just as poj's solution assumes; e.g. for
newer windows platforms this number is practically
unlimited (nominally about 16 mio per process). If
this is not the case you'd have to do this in several
passes.
Does poj's suggestion suit your needs?
| [reply] |