Table manipulation, array or hash?

robertkraus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks!

I am an early perl freshman. So far I do some text manipualtion in a line-by-line style on plain text documents. What I'd like to parse now is a table that needs to be rearranged so that info from a certain line goes to another place at another line. This line crossing is new to me and I can't get my head solving it...

Here is an example that I wrote to illustrate:

SNP5    IND1    A5    C5    0.8
SNP2    IND1    A2    C2    0.8
SNP1    IND1    A1    C1    0.8
SNP3    IND1    A3    C3    0.8
SNP4    IND1    A4    C4    0.8
SNP5    IND2    G5    T5    0.8
SNP2    IND2    G2    T2    0.8
SNP1    IND2    G1    T1    0.8
SNP3    IND2    G3    T3    0.8
SNP4    IND2    G4    T4    0.8
[download]

The last column (all the 0.8s) is not needed. And the first column (SNP1-5) is used to sort the info in columns 3 and 4 to the correct positions (columns 3 and 4 give possible states of SNP per IND). Each combination of IND and SNP is unique in this data. In my output I would like to get a table with a first column giving each IND, followed by the sorted info about its state for each SNP

This table needs to be stored as:

IND1    A1    C1    A2    C2    A3    C3    A4    C4    A5    C5
IND2    G1    T1    G2    T2    G3    T3    G4    T4    G5    T5
[download]

So I need to get from a multi-line format that stores the two possible states of IND per SNP, to a one-line format for every IND. Getting the SNPs sorted properly makes it possible to omit the name of the SNPs. They don't need to appear in the output. All I need is the right order (1-5 for each state).

My first strategy was to use hash tables to store each IND-SNP combination as key and each state combination as value. This works, but now I run into problems when I want to arrange the output table.

This is also the first time I do something with hash tables, so I am quite happy that I get my table populated already using this code:

while (<>) {

        @intarray = split('\t');

        $key = $intarray[1].",".$intarray[0];
        $value = $intarray[2].",".$intarray[3];

        $datahash{$key} = $value;
        }
[download]

I use commas as separators to possibly split the keys and values apart later on. The problem is just I don't know how to go on... I had to learn that I can't use regular expressions in calling values of a certain list of keys (like calling the values from any key matching ^IND1* in a foreach loop)... But also any other starting point to solve this reformatting challenge is highly appreaciated!

Comment on Table manipulation, array or hash? Select or Download Code

Replies are listed 'Best First'.
Re: Table manipulation, array or hash? by GrandFather (Saint) on Mar 23, 2010 at 10:08 UTC
You are part way there. But really what you want is a hash of hash of arrays. Consider: use strict; use warnings; my %data; while (<DATA>) { chomp; my ($snp, $ind, @parts) = split; next if @parts < 2; $data{$ind}{$snp} = \@parts; } for my $ind (sort keys %data) { print "$ind"; for my $snp (sort keys %{$data{$ind}}) { my $parts = $data{$ind}{$snp}; print "\t$parts->[0]\t$parts->[1]"; } print "\n"; } __DATA__ SNP5 IND1 A5 C5 0.8 SNP2 IND1 A2 C2 0.8 SNP1 IND1 A1 C1 0.8 SNP3 IND1 A3 C3 0.8 SNP4 IND1 A4 C4 0.8 SNP5 IND2 G5 T5 0.8 SNP2 IND2 G2 T2 0.8 SNP1 IND2 G1 T1 0.8 SNP3 IND2 G3 T3 0.8 SNP4 IND2 G4 T4 0.8 [download] Prints: `IND1 A1 C1 A2 C2 A3 C3 A4 C4 A5 C5 IND2 G1 T1 G2 T2 G3 T3 G4 T4 G5 T5` [download] True laziness is hard work	[reply] [d/l] [select]
Re^2: Table manipulation, array or hash? by robertkraus (Novice) on Mar 23, 2010 at 11:30 UTC
Amazing! I was sitting over this already for 3 days or so (not really full time, but msot of the work day actually), and here the solution comes in a few mintues! I adopted it to implement into my workflow and it works like a dream, thanks a lot!	[reply]
Re: Table manipulation, array or hash? by biohisham (Priest) on Mar 23, 2010 at 10:09 UTC
Think about a data structure called a Hash of Array.. `#!/usr/local/bin/perl use strict; use warnings; my %hash; while(<DATA>){ chomp; my ($SNP, $IND, $COL1, $COL2, $COL3)= split(/\s+/); push @{$hash{$IND}}, $COL1, $COL2 if $IND; } use Data::Dumper; print Dumper(\%hash); __DATA__ SNP5 IND1 A5 C5 0.8 SNP2 IND1 A2 C2 0.8 SNP1 IND1 A1 C1 0.8 SNP3 IND1 A3 C3 0.8 SNP4 IND1 A4 C4 0.8 SNP5 IND2 G5 T5 0.8 SNP2 IND2 G2 T2 0.8 SNP1 IND2 G1 T1 0.8 SNP3 IND2 G3 T3 0.8 SNP4 IND2 G4 T4 0.8` [download] check The Data Structure Cookbook. sorting and printing the outcome is left as an exercise to the OP Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.	[reply] [d/l] [select]
Re^2: Table manipulation, array or hash? by robertkraus (Novice) on Mar 23, 2010 at 11:27 UTC
Thanks a lot! Looks extremely efficient, good to know that there is a module that takes over some of the work!	[reply]
Re^3: Table manipulation, array or hash? by deMize (Monk) on Mar 23, 2010 at 13:37 UTC
Data::Dumper is a module that dumps snapshots of your variables/values. It's not really doing anything in this case, other than being used for debugging. The work being done is in the loops.	[reply]
Re: Table manipulation, array or hash? by ack (Deacon) on Mar 23, 2010 at 17:36 UTC
I took a somewhat similar approach to the other reponders and used a Hash of Hashes (HoH) and combined the two data values for each entry into a string which could latter be `split()` so as to avoid the third level of deep data structure. My approach (with just the principle code snippet) is shown below: my %myHash = (); my %tempHash = (); foreach (@lines){ my($key1,$key2,$val1,$val2,$rest) = split(/\s+/,$_,5); my $combinedValue = sprintf("%2s,%2s",$val1,$val2); $key1 =~ /SNP(\d+)/; my $indx = $1; if(exists $myHash{$key2}){ %tempHash = %{$myHash{$key2}}; $tempHash{$indx} = $combinedValue; $myHash{$key2} = {%tempHash}; } else { $tempHash{$indx} = $combinedValue; $myHash{$key2} = {%tempHash}; } } foreach my $key (sort keys %myHash){ my %tempHash2 = %{$myHash{$key}}; my $line2output = "$key "; foreach my $sortedKey (sort keys %tempHash2){ $line2output .= sprintf(" %2s %2s",split(',',$tempHash2{$sortedKey})); } print "$line2output\n"; } [download] I have also put the OP's example data input into an array, `@lines`, to simplify my testing. Assuming that the lines are being read in from a file, one would do a `foreach (<INPUT>){}` rather than my `foreach (@lines){}` structure. I hope this helps and shows yet another approach that works. I didn't spend a lot of time optimizing or simplifying. I figure that is a worthwhile exercise for the reader and the OP. ack Albuquerque, NM	[reply] [d/l] [select]
Re^2: Table manipulation, array or hash? by GrandFather (Saint) on Mar 23, 2010 at 20:14 UTC
one would do a `foreach (<INPUT>){}` No one wouldn't. One might do `while (<$inFile>) {...}` however. Perl for loops like to work with lists of things and will generally create a list (except in a few special cases) which in the code you suggested would slurp the entire file into memory - something that should generally be avoided. That aside, I find your sample code very 'busy' with repeated code and needless (and poorly named) variables. Contrast it with the following: `my %dataHash; foreach (@lines) { my ($key1, $key2, $val1, $val2, $rest) = split(/\s+/, $_, 5); my $combinedValue = "$val1,$val2"; $key1 =~ /SNP(\d+)/; $dataHash{$key2}{$1} = $combinedValue; } foreach my $key (sort keys %dataHash) { my %tempHash2 = %{$dataHash{$key}}; print $key; printf(" %2s %2s", split(',', $tempHash2{$_})) for sort keys %temp +Hash2; print "\n"; }` [download] In a teaching context it is desirable to present the cleanest code you can and to demonstrate best practises. Worthwhile exercises for the reader generally entail extending the code in various ways - not in trying to compensate for the sample's deficiencies. True laziness is hard work	[reply] [d/l] [select]