Renyulb28 has asked for the wisdom of the Perl Monks concerning the following question:

For all you monks out there, this question might seem trivial, but has stumped my colleagues and I for a while. We have just received a collection of data back, and need to format it to be used with another program. The data is formated as a 1540132 X 5 matrix. There are 142 samples, and 10846 marker measurements for each sample. Thus, 142 X 10846 = 1540132 lines. The lines are set up in this way: there are 10846 groups of samples, and each group has the samples listed from 1 - 142. Column 1 is the sample ID, column 2 is the marker ID, column 3 is unimportant, and columns 4 and 5 are the two observations for each sample at that marker. Thus, it looks like

JL0001 Cpn_1054417303 420864 C C JL0002 Cpn_1054417303 420864 C C JL0003 Cpn_1054417303 420864 C C JL0004 Cpn_1054417303 420864 C C JL0005 Cpn_1054417303 420864 C C JL0006 Cpn_1054417303 420864 C C JL0007 Cpn_1054417303 420864 C C JL0008 Cpn_1054417303 420864 C C JL0009 Cpn_1054417303 420864 C C JL0010 Cpn_1054417303 420864 C C JL0011 Cpn_1054417303 420864 C C JL0012 Cpn_1054417303 420864 C C JL0013 Cpn_1054417303 420864 C C JL0014 Cpn_1054417303 420864 C C JL0015 Cpn_1054417303 420864 C C JL0016 Cpn_1054417303 420864 C C

What we wish to do is to move the observations for columns 4 and 5 after every 142 lines after the original 142 samples and to only keep the column 1 in the final file along with all of the column 4's and 5's subsequently after each other. The final matrix should be 142 X 21693 (samples X (markers*2 + 1)

JL0001 C C C C C C C C ... JL0002 C C C C C C C C ... JL0003 C C C C C C C C ... JL0004 C C C C C C C C ... JL0005 C C C C C C C C ... JL0006 C C C C C C C C ... JL0007 C C C C C C C C ... JL0008 C C C C C C C C ... JL0009 C C C C C C C C ...
I'd greatly appreciate anyone's help, as you would be doing a great deed for a group in need.

Replies are listed 'Best First'.
Re: High Density Data Aid - swapping specific combination of lines/columns repeatedly
by wind (Priest) on Apr 13, 2011 at 18:42 UTC
    If you have enough memory, just put all your data in a hash of arrays.
    my %hash; while (<$fh>) { chomp; my @data = split /\s+/; push @{$hash{$data[0]}}, @data[3,4]; } for my $key (sort keys %hash) { print join(' ', $key, @{$hash{$key}}), "\n"; }
      with that method, how can I specific my parameters of lines that I wish to move to the right of the previous set of 142 samples from columns 4 and 5?

        By move to the right, I take it you mean append each new measurement to the appropriate sample's list.

        Given that each row contains the sample's name, we don't need to keep track of the fact that there are exactly 142 samples. Instead, this just adds each subsequent measurement to that sample's array of values. And at the end we write out all the results associated with each sample.