sesemin has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I am reading a tab delimited file with 6 columns into a hash and trying to have access to the each element. In two columns the names re repeating over and over, so they are not unique. But the code does not really work at the following lines:
foreach my $affyscore ( sort keys %{$row{$origin{$pip{$probe +set_id}}}}){ print "$affyscore\t"; foreach my $gc ( sort keys %{$row{$origin{$pip{$probeset_id{$affyscore +}}}}}){ . . .
The actual program is here: please forget ARGV0 and ARGV2
#!/usr/bin/perl -w use strict; use warnings; ###################################################################### +############### # Open Input and Output Files + # ###################################################################### +############### if( @ARGV < 1){ print " Enter input file\n"; exit 0; } open(INPUT2,$ARGV[0]) || die "Cannot open file \"$ARGV[0]\""; # Analys +is file my @data =(); my %row =(); while(<INPUT2>){ chomp; @data=split (/\t/,$_); my $probeset_id = $data[0]; my $origin = $data[1]; my $probeseq = $data[2]; my $pip = $data[3]; my $gc = $data[4]; my $affyscore = $data[5]; $row{$origin}{$pip}{$probeset_id}{$affyscore}{$gc} =$probeseq; } print "Fished establishing hashs.\n"; foreach my $origin(sort keys %row) { print "\n$origin\t"; foreach my $pip (sort keys %{$row{$origin}}){ print "$pip\t"; foreach my $probeset_id (sort keys %{$row{$origin{$p +ip}}}){ print "$probeset_id\t"; foreach my $affyscore ( sort keys %{$row{$orig +in{$pip{$probeset_id}}}}){ print "$affyscore\t"; foreach my $gc ( sort keys %{$row{$origin{$pi +p{$probeset_id{$affyscore}}}}}){ print "$gc\t"; foreach my $probeseq ( sort keys %{$row{$origi +n{$pip{$probeset_id{$affyscore{$gc}}}}}}){ print "$probeseq\t"; print "\n"; } } } } } } close (INPUT2);

Replies are listed 'Best First'.
Re: Complex Data Structure
by GrandFather (Saint) on Sep 14, 2008 at 22:17 UTC

    Show us a small sample of your data. Your current structure looks very unlikely to be best. If all you want to be able to do is sort the data in some fashion then consider:

    use strict; use warnings; my $data = <<DATA; 1,2,3,4 2,3,4,1 3,4,4,2 4,1,2,3 DATA my @rows; open my $inFile, '<', \$data; while (<$inFile>) { chomp; push @rows, [split ',']; } close $inFile; print join (',', @$_), "\n" for sort mySort @rows; sub mySort { return $a->[1] <=> $b->[1] || $a->[3] <=> $b->[3] || $a->[2] <=> $b->[2] || $a->[0] <=> $b->[0]; }

    Prints:

    4,1,2,3 1,2,3,4 2,3,4,1 3,4,4,2

    Which uses an AoA, sort and a user defined sort sub to get the rows sorted by arbitrary columns.


    Perl reduces RSI - it saves typing
      Thank you Grand Father, I have included a sample of my data structure and what I really want to do, in response to HEXCODER. Your help and thoughts is appreciated. Pedro
Re: Complex Data Structure
by hexcoder (Curate) on Sep 14, 2008 at 22:13 UTC
    foreach my $probeset_id (sort keys %{$row{$origin{$pip}}}){
    this construct will not work (like you intended). There is no hash named %origin like in $origin{$pip}. You constructed a hash whose values are references to anonymous hashes (which have no entry in the symbol table and can only be accessed through their reference). Thats fine, you only need to access it through refernces.

    Since you have a hash of hashes (of hashes...) better write it like this

    foreach my $probeset_id (sort keys %{$row{$origin}{$pip}}){
    You can retrieve the values like you assigned them. When assigning the values you made use of "autovivification". That is, you relied on Perl to create all hash entries, you referenced. Without "autovivification" your line
    $row{$origin}{$pip}{$probeset_id}{$affyscore}{$gc} =$probeseq;
    would have been written like this
    $row{$origin} = {}; # set value to an empty hash reference $row{$origin}{$pip} = {}; $row{$origin}{$pip}{$probeset_id} = {}; $row{$origin}{$pip}{$probeset_id}{$affyscore} = {}; $row{$origin}{$pip}{$probeset_id}{$affyscore}{$gc} =$probeseq;
      Thank you very much hexcoder and others for prompt response, let's go back to the root of the problem and see why I ended up to construct this wierd structure. I have a file like the following:
      CLS_S3_Contig100_st CLS_S3_Contig100 53 10 0.3717 CLS_S3_Contig100_at CLS_S3_Contig100 55 11 0.4321 CLS_S3_Contig100_st CLS_S3_Contig100 57 10 0.3223 CLS_S3_Contig100_at CLS_S3_Contig100 59 11 0.4055 CLS_S3_Contig100_st CLS_S3_Contig100 61 11 0.4511 CLS_S3_Contig100_at CLS_S3_Contig100 63 11 0.474 . . . CLS_S3_Contig10031_st CLS_S3_Contig10031 53 12 0.5548 CLS_S3_Contig10031_st CLS_S3_Contig10031 57 10 0.4871 CLS_S3_Contig10031_st CLS_S3_Contig10031 61 12 0.547 CLSS3627.b1_F19.ab1 CLS_S3_Contig10031 62 11 0.5129 CLSS3627.b1_F19.ab1 CLS_S3_Contig10031 64 11 0.5789
      Each field is tab separated. The second (let's call it origin) and third columns (let's call it PIP) are important to me. As you can see the third column is jumping from one or more units to the next. What I want to do is to create a file with the same columns with an extra 0 and 1 column. I will explain what are 0 and 1. for each origin (2nd column) I want PIP starts from 1 to its max in the file. Therefore at each origin change I will start from 1 to the max. for example for Contig100 max is 63, therefore, I will have a hash with multiple value that key is contig100 and values go from 1 to 63.This should be done at each origin change. Now that we have this hash, I need to compare it with my file that is already opened. If Contig100 and PIP=63 exist then in the new column put 1 if not put 0. Therefore for PIP 1 to 52 I will have all 0's then at 53, 54, 55 to 63 I will have 1's and so on. Here is the code that I wrote.
      my %hash_F2_1 =(); my %hash_F2_2 =(); my %hash_F2_3 = ();my %hash_F2_4 += (); while(<INPUT2>){ (my $probeset_id, my $origin, my $probeseq, my $pip, my $gc, +my $affyscore) = split("\t", $_); push (@{$hash_F2_1{$origin}}, $pip); # This makes a hash of m +ultiple value for each probeset ID } #clculate the max and put into hash (origin and max of pip) for (sort keys %hash_F2_1){ #print RESULTS "$_", "\t", max(@{$hash_F2_1{$_}}), "\n"; $hash_F2_2{$_} = max(@{$hash_F2_1{$_}}); } # then go thorough $hash_F2_2 and that has the key (origin) # and max +as value and make a multivalue hash as follows: for my $k (sort keys %hash_F2_2){ my $v = $hash_F2_2{$k}; #print RESULTS "$k\t$v\n"; for (my $i=1; $i <= $v; $i++){ push (@{$hash_F2_3{$k}}, $i); # this hash (%hash_F2_3) is the hash of origins and PIPs from 1 to max } } LOOP1: foreach my $key(sort keys %hash_F2_3){ LOOP2: foreach my $position (@{$hash_F2_3{$key}}){ LOOP3: foreach my $key1(sort keys %hash_F2_1){ LOOP4: foreach my $position1(@{$hash_F2_1{$key1}}) +{ LOOP1: foreach my $key(sort keys %hash_F2_3){ LOOP2: foreach my $position (@{$hash_F2_3{$key}}){ LOOP3: foreach my $key1(sort keys %hash_F2_1){ LOOP4: foreach my $position1(@{$hash_F2_1{$key1}}) +{ if ($key =~ m/$key1/ && $position==$position1 ) { print "$key\t$position\t1\n"; } else { print "$key\t$position\t0\n"; } } } } }
      This code does not work appropriately, although my rational seem to be ok.

        If there are only two "interesting" columns, ignore the rest. Consider:

        use strict; use warnings; my $data = <<DATA; CLS_S3_Contig100_st,CLS_S3_Contig100,53,10,0.3717 CLS_S3_Contig100_at,CLS_S3_Contig100,55,11,0.4321 CLS_S3_Contig100_st,CLS_S3_Contig100,57,10,0.3223 CLS_S3_Contig100_at,CLS_S3_Contig100,59,11,0.4055 CLS_S3_Contig100_st,CLS_S3_Contig100,61,11,0.4511 CLS_S3_Contig100_at,CLS_S3_Contig100,63,11,0.474 CLS_S3_Contig10031_st,CLS_S3_Contig10031,53,12,0.5548 CLS_S3_Contig10031_st,CLS_S3_Contig10031,57,10,0.4871 CLS_S3_Contig10031_st,CLS_S3_Contig10031,61,12,0.547 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,62,11,0.5129 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,64,11,0.5789 DATA my %origins; my $numColumns; open my $inFile, '<', \$data; while (<$inFile>) { chomp; next unless length; my @columns = split ','; $numColumns ||= @columns; # Assume first row has correct column co +unt $origins{$columns[1]}[$columns[2] - 1] = \@columns; } close $inFile; for my $oKey (sort keys %origins) { my $origin = $origins{$oKey}; for my $pip (0 .. $#$origin) { my $row = $origin->[$pip]; if (defined $row) { # pip exists in original file print join (",", @$row, '1'), "\n"; } else { # pip doesn't exist in original file print ",$oKey,", $pip + 1, ',' x ($numColumns - 2), "0\n"; } } }

        Prints (with large middle portion skipped):

        ,CLS_S3_Contig100,1,,,0 ,CLS_S3_Contig100,2,,,0 ,CLS_S3_Contig100,3,,,0 ,CLS_S3_Contig100,4,,,0 ,CLS_S3_Contig100,5,,,0 ... CLS_S3_Contig10031_st,CLS_S3_Contig10031,57,10,0.4871,1 ,CLS_S3_Contig10031,58,,,0 ,CLS_S3_Contig10031,59,,,0 ,CLS_S3_Contig10031,60,,,0 CLS_S3_Contig10031_st,CLS_S3_Contig10031,61,12,0.547,1 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,62,11,0.5129,1 ,CLS_S3_Contig10031,63,,,0 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,64,11,0.5789,1

        which demonstrates what I understand you to want.

        The key points are using a HoAoA where the hash is keyed by the origin and the array is indexed by PIP (- 1). Note that Perl generates the missing array elements, but sets them to undef so you can test for defined to see if you encountered the PIP in the original file.


        Perl reduces RSI - it saves typing

        Your sample data says you have five fields but your code says you have six fields?   So which is it?

        # then go thorough $hash_F2_2 and that has the key (origin) # and max as value and make a multivalue hash as follows: for my $k (sort keys %hash_F2_2){ my $v = $hash_F2_2{$k}; #print RESULTS "$k\t$v\n"; for (my $i=1; $i <= $v; $i++){ push (@{$hash_F2_3{$k}}, $i); # this hash (%hash_F2_3) is the hash of origins and PIPs from +1 to max } }

        You don't need the inner foreach loop, you can just use the range operator.   And if you remove the loop then you don't need to use push either.   And you don't really need the %hash_F2_2 hash either:

        # then go thorough $hash_F2_1 and that has the key (origin) # and make a multivalue hash as follows: my %hash_F2_3; for my $k ( sort keys %hash_F2_1 ) { #print RESULTS "$k\t$hash_F2_1{ $k }\n"; @{ $hash_F2_3{ $k } } = 1 .. max @{ $hash_F2_1{ $k } }; # this hash (%hash_F2_3) is the hash of origins and PIPs from 1 to + max }
Re: Complex Data Structure
by oko1 (Deacon) on Sep 14, 2008 at 22:47 UTC

    Here's (the beginning of) your problem:

    foreach my $probeset_id (sort keys %{$row{$origin{$pip}}}){

    In the above, you're saying "return the value for key $pip from hash %origin; use that value as a key in hash %row" and so on. Notice the "%origin" in this description? You don't have one. Instead, you have a scalar called $origin - and what you actually mean is

    foreach my $probeset_id (sort keys %{$row{$origin}{$pip}}){

    ...and so on. The other error you have - a minor one, in comparison to the above - is that you've dug down one level too deep in your last 'foreach' statement:

    foreach my $probeseq ( sort keys %{$row{$origin}{$pip}{$probeset_id}{$ +affyscore}{$gc}}){

    If you look back to where you construct that hash, you'll notice that $row{$origin}{$pip}{$probeset_id}{$affyscore}{$gc}} is actually a scalar - not a hash. Trying to extract keys from it, as you do in the above statement, is not likely to be productive. Simply print the value of this variable.

    Last of all - this isn't a mistake, but it is likely to lead to one (a mistype):

    @data=split (/\t/,$_); my $probeset_id = $data[0]; my $origin = $data[1]; my $probeseq = $data[2]; my $pip = $data[3]; my $gc = $data[4]; my $affyscore = $data[5];

    This is unnecessary. Better to use a hash slice:

    my ($probeset_id, $origin, $probeseq, $pip, $gc, $affyscore) = split / +\t/;
    
    -- 
    Human history becomes more and more a race between education and catastrophe. -- HG Wells
    
      my ($probeset_id, $origin, $probeseq, $pip, $gc, $affyscore) = split / +\t/;

      is a list assignment. A hash slice assignment looks like:

      @hash{qw(this that the other)} = qw{some value or another);

      Perl reduces RSI - it saves typing

        Err... trying to do too many things at once. I started by demonstrating a hash slice, realized that it wasn't the best solution, changed the code, and forgot to change the comments. Thanks for catching it!

        
        -- 
        Human history becomes more and more a race between education and catastrophe. -- HG Wells
        
      Hi OKO1, Thanks for prompt response. Please see my response to Grand Father. What I really meant to do. Pedro
Re: Complex Data Structure
by injunjoel (Priest) on Sep 15, 2008 at 18:23 UTC
    Greetings,
    Here are two modules that I would suggest getting familiar with Data::Dumper or Dumpvalue. Either one will help you visualize complex data structures. Learn them, love them, use them...

    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forgo their use." -Galileo