Serial_ has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks, I am looking for some wisdom, I have a table with 40+ columns and 10000+ rows (bioinformaticians lol) and I have started programming in Perl few months ago. I want to write a program for performing some calculation on tables (which are in the 99% of the cases the output from various softwares). I need to able to : select the replicates per group, keep the replicates separated (es. cond A1/2/3 needs to be tested against Ctrl1/2/3 not against B1/2/3) and to perform the calculation element-wise (i.e t-test needs to be performed using the first element in the three replicate in A vs 3 replicate of ctrl so I thought the most convenient data structure for storing it would be AoA or HoA. the column names are like "somethingcostant_A1" so the first part of the string is common to every bunch of column. An example of the dataset could be found here.
https://www.dropbox.com/s/jvpqu8wlvwri8x6/proteinGroups.txt?dl=0I thought it would be pretty cool to be able to run a program who's creating a ref table first with the name of what it needs to be taken, groups and control. so enough bla bla let's go to the code.
use warnings; use strict; chdir "\.\/IN"; my @files = glob "*txt"; foreach (@files) { my $infile = "$_"; print "$infile opened\n"; open my $fh, "<", $infile or die "can not open '$infile' $infile\n +$!\n"; my @hh; while (<$fh>) { chomp; @hh = split /\t/, $_; last; } close $fh or die "can not close filehandle\n"; print "$infile closed\n"; chdir "\.\."; print "insert column name used for calculation\ni.e MaxQuant LFQ inten +sity\n"; chomp( my $w = <STDIN> ); my @lfq = grep { /^$w/ } @hh; @lfq = map { ( my $foo = $_ ) =~ s/^$w //; $foo } @lfq; print "insert name of control without replicate number\n"; chomp( my $c = <STDIN> ); my @crl = grep { /^$c*/ } @lfq; my @res = grep { !/^$c*/ } @lfq; push @crl, @res; print "insert number of groups\n"; chomp( my $g = <STDIN> ); print "insert number of replicates\n"; chomp( my $l = <STDIN> ); my $length = scalar @lfq; if ( $length == $g * $l ) { my @out; my $nn = join "\t", @crl; my @ll = ( 1 .. $g ); my @dd = ( 1 .. $l ); my @gg; for ( 1 .. $g ) { until ( scalar @gg == $_ * $l ) { push @gg, $_; } } my @cc; for ( 1 .. $length ) { push @cc, $w } my $hj = join "\t", @cc; my $jj = join "\t", @gg; push @out, "$hj\n"; push @out, "$jj\n"; unshift @out, "$nn\n"; my $outfile = "\.\/INFO\/order.txt"; open my $out, ">", $outfile or die "$!"; print $out @out; print "results printed in $outfile\n"; close $out or die "can not close filehandle for printing\n$!\n +"; } else { print "wrong number of groups or replicates\ncheck column names for typos\n" +; } }
and this is working no problem, then this is the major program where the calculation should be performed.
#!/Users/andreafossati/perl5/perlbrew/perls/perl-5.24.0/bin/perl use warnings; use strict; use Data::Dumper; my $info = "\.\/INFO\/order.txt"; open my $fh, "<", $info or die "can not open 'info' txt, run order pl +first\n$!\n"; my (@bb, %kk, $us, @sam); my $sw = 0; while (<$fh>) { if (!/^[1-9]{1}/ && $sw == 0 ) { @sam = split /\t/, $_; $sw = 1; next; } if (/^[1-9]{1}/ && ( @sam ) ) { @bb = split /\t/, $_; @kk{@sam} = @bb; next; } my $st = $_; my @ff = split /\t/, $st; @ff = uniq(@ff); $us = $ff[0]; } close $fh or die "can not close 'info' fh\n$!"; my @m = reverse sort {$a <=> $b} values %kk; my $gr = $m[0]; (@m, @bb) = (); print "\n$gr groups found, column used for quantification: $us\n"; print "\ninsert filename to be processed\nneeds to be a tab delimited +file in IN folder\n"; chomp (my $filename = <STDIN>) ; my $in = "\.\/IN\/$filename.txt"; print "\ninsert prot identifier name\n"; chomp (my $w = <STDIN>); open my $fh_2, "<", $in or die "can not open $filename,\n$!\n"; my (@h, @d, @id, %gg ) = (); while (<$fh_2>) { chomp; my %t = (); if (/.*$w.*/) {@h = split /\t/, $_; next;} if ( @h ) { @d = split /\t/, $_; @t{@h} = @d; my $ll = $t{$w}; push @id, $ll; my @k; for (keys %kk) { my @cm = grep { /^$us.*/ } keys(%t); push @k, @cm; } @k = uniq(@k); for (@k) { my $key = $_; (my $k = $key ) =~ s/^$us.//; push @{$gg{$k}}, $t{$key}; } } else {next;} } close $fh_2 or die "can not close 'filename' fh\n$!"; sub uniq { my %seen; grep !$seen{$_}++, @_; }
and also this one is fine. but I am having hard times taking out the things I want from the HoA, for example for normalizing the data I need to calculate the median per array then subtract per each element the medianof the column plus the mean of the medians previously calculated. I was thinking about using PDL and converting it to a matrix (AoA) then performing everything, but I am not so expert in Perl and after reading a lot about complex data structure I m kinda confused in how to convert from HoA to AoA and I also don't know if a matrix would be the correct structure for this sort of problem. Or if I simply use R and whatever. Ideally my output would be a data structure that allows to perform element-wise calculation without loosing the order of the elements (every position for every array is a different protein and needs to be compared with he same protein in the different arrays), furthermore I also need to be able to separate the 3 control from the rest. Please note in the dataset provided I have got just six samples (3+3) usually it is at least 50 so it needs to be something really flexible. I also need to not have them hardcoded cause otherwise it will means having to change them in every different runs. Sorry for the uber long post!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Statistics Data Structure Hash of Arrays, Arrays of Array
by BrowserUk (Patriarch) on Nov 10, 2016 at 11:05 UTC | |
by Serial_ (Novice) on Nov 10, 2016 at 12:32 UTC | |
by AnomalousMonk (Archbishop) on Nov 10, 2016 at 14:36 UTC | |
by Anonymous Monk on Nov 10, 2016 at 23:25 UTC | |
by AnomalousMonk (Archbishop) on Nov 11, 2016 at 05:05 UTC | |
|
Re: Statistics Data Structure Hash of Arrays, Arrays of Array
by BrowserUk (Patriarch) on Nov 11, 2016 at 00:51 UTC | |
by etj (Priest) on May 19, 2022 at 16:16 UTC |