robertkraus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks!

I am an early perl freshman. So far I do some text manipualtion in a line-by-line style on plain text documents. What I'd like to parse now is a table that needs to be rearranged so that info from a certain line goes to another place at another line. This line crossing is new to me and I can't get my head solving it...

Here is an example that I wrote to illustrate:

SNP5 IND1 A5 C5 0.8 SNP2 IND1 A2 C2 0.8 SNP1 IND1 A1 C1 0.8 SNP3 IND1 A3 C3 0.8 SNP4 IND1 A4 C4 0.8 SNP5 IND2 G5 T5 0.8 SNP2 IND2 G2 T2 0.8 SNP1 IND2 G1 T1 0.8 SNP3 IND2 G3 T3 0.8 SNP4 IND2 G4 T4 0.8

The last column (all the 0.8s) is not needed. And the first column (SNP1-5) is used to sort the info in columns 3 and 4 to the correct positions (columns 3 and 4 give possible states of SNP per IND). Each combination of IND and SNP is unique in this data. In my output I would like to get a table with a first column giving each IND, followed by the sorted info about its state for each SNP

This table needs to be stored as:

IND1 A1 C1 A2 C2 A3 C3 A4 C4 A5 C5 IND2 G1 T1 G2 T2 G3 T3 G4 T4 G5 T5

So I need to get from a multi-line format that stores the two possible states of IND per SNP, to a one-line format for every IND. Getting the SNPs sorted properly makes it possible to omit the name of the SNPs. They don't need to appear in the output. All I need is the right order (1-5 for each state).

My first strategy was to use hash tables to store each IND-SNP combination as key and each state combination as value. This works, but now I run into problems when I want to arrange the output table.

This is also the first time I do something with hash tables, so I am quite happy that I get my table populated already using this code:

while (<>) { @intarray = split('\t'); $key = $intarray[1].",".$intarray[0]; $value = $intarray[2].",".$intarray[3]; $datahash{$key} = $value; }

I use commas as separators to possibly split the keys and values apart later on. The problem is just I don't know how to go on... I had to learn that I can't use regular expressions in calling values of a certain list of keys (like calling the values from any key matching ^IND1* in a foreach loop)... But also any other starting point to solve this reformatting challenge is highly appreaciated!

Replies are listed 'Best First'.
Re: Table manipulation, array or hash?
by GrandFather (Saint) on Mar 23, 2010 at 10:08 UTC

    You are part way there. But really what you want is a hash of hash of arrays. Consider:

    use strict; use warnings; my %data; while (<DATA>) { chomp; my ($snp, $ind, @parts) = split; next if @parts < 2; $data{$ind}{$snp} = \@parts; } for my $ind (sort keys %data) { print "$ind"; for my $snp (sort keys %{$data{$ind}}) { my $parts = $data{$ind}{$snp}; print "\t$parts->[0]\t$parts->[1]"; } print "\n"; } __DATA__ SNP5 IND1 A5 C5 0.8 SNP2 IND1 A2 C2 0.8 SNP1 IND1 A1 C1 0.8 SNP3 IND1 A3 C3 0.8 SNP4 IND1 A4 C4 0.8 SNP5 IND2 G5 T5 0.8 SNP2 IND2 G2 T2 0.8 SNP1 IND2 G1 T1 0.8 SNP3 IND2 G3 T3 0.8 SNP4 IND2 G4 T4 0.8

    Prints:

    IND1 A1 C1 A2 C2 A3 C3 A4 C4 A5 C5 IND2 G1 T1 G2 T2 G3 T3 G4 T4 G5 T5

    True laziness is hard work
      Amazing! I was sitting over this already for 3 days or so (not really full time, but msot of the work day actually), and here the solution comes in a few mintues! I adopted it to implement into my workflow and it works like a dream, thanks a lot!
Re: Table manipulation, array or hash?
by biohisham (Priest) on Mar 23, 2010 at 10:09 UTC
    Think about a data structure called a Hash of Array..
    #!/usr/local/bin/perl use strict; use warnings; my %hash; while(<DATA>){ chomp; my ($SNP, $IND, $COL1, $COL2, $COL3)= split(/\s+/); push @{$hash{$IND}}, $COL1, $COL2 if $IND; } use Data::Dumper; print Dumper(\%hash); __DATA__ SNP5 IND1 A5 C5 0.8 SNP2 IND1 A2 C2 0.8 SNP1 IND1 A1 C1 0.8 SNP3 IND1 A3 C3 0.8 SNP4 IND1 A4 C4 0.8 SNP5 IND2 G5 T5 0.8 SNP2 IND2 G2 T2 0.8 SNP1 IND2 G1 T1 0.8 SNP3 IND2 G3 T3 0.8 SNP4 IND2 G4 T4 0.8
    check The Data Structure Cookbook.

    sorting and printing the outcome is left as an exercise to the OP


    Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
      Thanks a lot! Looks extremely efficient, good to know that there is a module that takes over some of the work!
        Data::Dumper is a module that dumps snapshots of your variables/values. It's not really doing anything in this case, other than being used for debugging. The work being done is in the loops.
Re: Table manipulation, array or hash?
by ack (Deacon) on Mar 23, 2010 at 17:36 UTC

    I took a somewhat similar approach to the other reponders and used a Hash of Hashes (HoH) and combined the two data values for each entry into a string which could latter be split() so as to avoid the third level of deep data structure.

    My approach (with just the principle code snippet) is shown below:

    my %myHash = (); my %tempHash = (); foreach (@lines){ my($key1,$key2,$val1,$val2,$rest) = split(/\s+/,$_,5); my $combinedValue = sprintf("%2s,%2s",$val1,$val2); $key1 =~ /SNP(\d+)/; my $indx = $1; if(exists $myHash{$key2}){ %tempHash = %{$myHash{$key2}}; $tempHash{$indx} = $combinedValue; $myHash{$key2} = {%tempHash}; } else { $tempHash{$indx} = $combinedValue; $myHash{$key2} = {%tempHash}; } } foreach my $key (sort keys %myHash){ my %tempHash2 = %{$myHash{$key}}; my $line2output = "$key "; foreach my $sortedKey (sort keys %tempHash2){ $line2output .= sprintf(" %2s %2s",split(',',$tempHash2{$sortedKey})); } print "$line2output\n"; }

    I have also put the OP's example data input into an array, @lines, to simplify my testing. Assuming that the lines are being read in from a file, one would do a foreach (<INPUT>){} rather than my foreach (@lines){} structure.

    I hope this helps and shows yet another approach that works. I didn't spend a lot of time optimizing or simplifying. I figure that is a worthwhile exercise for the reader and the OP.

    ack Albuquerque, NM
      one would do a foreach (<INPUT>){}

      No one wouldn't. One might do while (<$inFile>) {...} however. Perl for loops like to work with lists of things and will generally create a list (except in a few special cases) which in the code you suggested would slurp the entire file into memory - something that should generally be avoided.

      That aside, I find your sample code very 'busy' with repeated code and needless (and poorly named) variables. Contrast it with the following:

      my %dataHash; foreach (@lines) { my ($key1, $key2, $val1, $val2, $rest) = split(/\s+/, $_, 5); my $combinedValue = "$val1,$val2"; $key1 =~ /SNP(\d+)/; $dataHash{$key2}{$1} = $combinedValue; } foreach my $key (sort keys %dataHash) { my %tempHash2 = %{$dataHash{$key}}; print $key; printf(" %2s %2s", split(',', $tempHash2{$_})) for sort keys %temp +Hash2; print "\n"; }

      In a teaching context it is desirable to present the cleanest code you can and to demonstrate best practises. Worthwhile exercises for the reader generally entail extending the code in various ways - not in trying to compensate for the sample's deficiencies.


      True laziness is hard work