comment on

Dear Monks, I am looking for some wisdom, I have a table with 40+ columns and 10000+ rows (bioinformaticians lol) and I have started programming in Perl few months ago. I want to write a program for performing some calculation on tables (which are in the 99% of the cases the output from various softwares). I need to able to : select the replicates per group, keep the replicates separated (es. cond A1/2/3 needs to be tested against Ctrl1/2/3 not against B1/2/3) and to perform the calculation element-wise (i.e t-test needs to be performed using the first element in the three replicate in A vs 3 replicate of ctrl so I thought the most convenient data structure for storing it would be AoA or HoA. the column names are like "somethingcostant_A1" so the first part of the string is common to every bunch of column. An example of the dataset could be found here.

https://www.dropbox.com/s/jvpqu8wlvwri8x6/proteinGroups.txt?dl=0

I thought it would be pretty cool to be able to run a program who's creating a ref table first with the name of what it needs to be taken, groups and control. so enough bla bla let's go to the code.

  
use warnings;
use strict;

chdir "\.\/IN";

my @files = glob "*txt";

foreach (@files) {
    my $infile = "$_";
    print "$infile opened\n";
    open my $fh, "<", $infile or die "can not open '$infile' $infile\n
+$!\n";

    my @hh;
    while (<$fh>) {
        chomp;
        @hh = split /\t/, $_;
        last;
    }

    close $fh or die "can not close filehandle\n";
    print "$infile closed\n";

    chdir "\.\.";

    print
      "insert column name used for calculation\ni.e MaxQuant LFQ inten
+sity\n";
    chomp( my $w = <STDIN> );

    my @lfq = grep { /^$w/ } @hh;
    @lfq = map { ( my $foo = $_ ) =~ s/^$w //; $foo } @lfq;

    print "insert name of control without replicate number\n";
    chomp( my $c = <STDIN> );

    my @crl = grep { /^$c*/ } @lfq;
    my @res = grep { !/^$c*/ } @lfq;
    push @crl, @res;

    print "insert number of groups\n";
    chomp( my $g = <STDIN> );

    print "insert number of replicates\n";
    chomp( my $l = <STDIN> );

    my $length = scalar @lfq;
    if ( $length == $g * $l ) {
        my @out;
        my $nn = join "\t", @crl;
        my @ll = ( 1 .. $g );
        my @dd = ( 1 .. $l );
        my @gg;
        for ( 1 .. $g ) {
            until ( scalar @gg == $_ * $l ) {
                push @gg, $_;
            }
        }
        my @cc;
        for ( 1 .. $length ) { push @cc, $w }

        my $hj = join "\t", @cc;
        my $jj = join "\t", @gg;
        push @out, "$hj\n";
        push @out, "$jj\n";
        unshift @out, "$nn\n";

        my $outfile = "\.\/INFO\/order.txt";
        open my $out, ">", $outfile or die "$!";
        print $out @out;
        print "results printed in $outfile\n";
        close $out or die "can not close filehandle for printing\n$!\n
+";
    }
    else {
        print
"wrong number of groups or replicates\ncheck column names for typos\n"
+;
    }

}
[download]

and this is working no problem, then this is the major program where the calculation should be performed.

 #!/Users/andreafossati/perl5/perlbrew/perls/perl-5.24.0/bin/perl 
use warnings;
use strict;
use Data::Dumper;


my $info = "\.\/INFO\/order.txt";

open my $fh, "<", $info or die "can not open 'info' txt, run order pl 
+first\n$!\n";

my (@bb, %kk, $us, @sam);

my $sw = 0;
while (<$fh>)  {

  if (!/^[1-9]{1}/ && $sw == 0 ) 
      { @sam = split /\t/, $_;
        $sw = 1; 
        next;
        }
    
    if (/^[1-9]{1}/ && ( @sam ) ) {
        @bb = split /\t/, $_;
        @kk{@sam} = @bb;
        next;
        }
   my $st = $_;
   my @ff = split /\t/, $st;
   @ff    = uniq(@ff);
   $us    = $ff[0];
}
close $fh or die "can not close 'info' fh\n$!";

my @m      = reverse sort {$a <=> $b} values %kk;
my $gr     = $m[0];
(@m, @bb)  = ();

print "\n$gr groups found, column used for quantification: $us\n";


print "\ninsert filename to be processed\nneeds to be a tab delimited 
+file in IN folder\n";
chomp (my $filename = <STDIN>) ;

my $in = "\.\/IN\/$filename.txt";

print "\ninsert prot identifier name\n";
chomp (my $w = <STDIN>);

 
open my $fh_2, "<", $in or die "can not open $filename,\n$!\n";

my (@h, @d, @id, %gg ) = ();

while (<$fh_2>) {
chomp;
my  %t = ();
if (/.*$w.*/) {@h = split /\t/, $_; next;}

if ( @h ) { 
    @d     = split /\t/, $_; 
    @t{@h} = @d;
    my $ll = $t{$w};
    push @id, $ll;   
    my @k;
    for (keys %kk)
    {
      my @cm = grep { /^$us.*/ } keys(%t); 
      push @k, @cm;
    }
    @k = uniq(@k);

    for (@k) { 
       my $key  = $_;
       (my $k   = $key ) =~ s/^$us.//;
       push @{$gg{$k}}, $t{$key};  
    }
  } else {next;}
}

close $fh_2 or die "can not close 'filename' fh\n$!";



sub uniq {
    my %seen;
    grep !$seen{$_}++, @_;
}
[download]

and also this one is fine. but I am having hard times taking out the things I want from the HoA, for example for normalizing the data I need to calculate the median per array then subtract per each element the medianof the column plus the mean of the medians previously calculated. I was thinking about using PDL and converting it to a matrix (AoA) then performing everything, but I am not so expert in Perl and after reading a lot about complex data structure I m kinda confused in how to convert from HoA to AoA and I also don't know if a matrix would be the correct structure for this sort of problem. Or if I simply use R and whatever. Ideally my output would be a data structure that allows to perform element-wise calculation without loosing the order of the elements (every position for every array is a different protein and needs to be compared with he same protein in the different arrays), furthermore I also need to be able to separate the 3 control from the rest. Please note in the dataset provided I have got just six samples (3+3) usually it is at least 50 so it needs to be something really flexible. I also need to not have them hardcoded cause otherwise it will means having to change them in every different runs. Sorry for the uber long post!

In reply to Statistics Data Structure Hash of Arrays, Arrays of Array by Serial_

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.