in reply to Re: Complex Data Structure
in thread Complex Data Structure

Thank you very much hexcoder and others for prompt response, let's go back to the root of the problem and see why I ended up to construct this wierd structure. I have a file like the following:
CLS_S3_Contig100_st CLS_S3_Contig100 53 10 0.3717 CLS_S3_Contig100_at CLS_S3_Contig100 55 11 0.4321 CLS_S3_Contig100_st CLS_S3_Contig100 57 10 0.3223 CLS_S3_Contig100_at CLS_S3_Contig100 59 11 0.4055 CLS_S3_Contig100_st CLS_S3_Contig100 61 11 0.4511 CLS_S3_Contig100_at CLS_S3_Contig100 63 11 0.474 . . . CLS_S3_Contig10031_st CLS_S3_Contig10031 53 12 0.5548 CLS_S3_Contig10031_st CLS_S3_Contig10031 57 10 0.4871 CLS_S3_Contig10031_st CLS_S3_Contig10031 61 12 0.547 CLSS3627.b1_F19.ab1 CLS_S3_Contig10031 62 11 0.5129 CLSS3627.b1_F19.ab1 CLS_S3_Contig10031 64 11 0.5789
Each field is tab separated. The second (let's call it origin) and third columns (let's call it PIP) are important to me. As you can see the third column is jumping from one or more units to the next. What I want to do is to create a file with the same columns with an extra 0 and 1 column. I will explain what are 0 and 1. for each origin (2nd column) I want PIP starts from 1 to its max in the file. Therefore at each origin change I will start from 1 to the max. for example for Contig100 max is 63, therefore, I will have a hash with multiple value that key is contig100 and values go from 1 to 63.This should be done at each origin change. Now that we have this hash, I need to compare it with my file that is already opened. If Contig100 and PIP=63 exist then in the new column put 1 if not put 0. Therefore for PIP 1 to 52 I will have all 0's then at 53, 54, 55 to 63 I will have 1's and so on. Here is the code that I wrote.
my %hash_F2_1 =(); my %hash_F2_2 =(); my %hash_F2_3 = ();my %hash_F2_4 += (); while(<INPUT2>){ (my $probeset_id, my $origin, my $probeseq, my $pip, my $gc, +my $affyscore) = split("\t", $_); push (@{$hash_F2_1{$origin}}, $pip); # This makes a hash of m +ultiple value for each probeset ID } #clculate the max and put into hash (origin and max of pip) for (sort keys %hash_F2_1){ #print RESULTS "$_", "\t", max(@{$hash_F2_1{$_}}), "\n"; $hash_F2_2{$_} = max(@{$hash_F2_1{$_}}); } # then go thorough $hash_F2_2 and that has the key (origin) # and max +as value and make a multivalue hash as follows: for my $k (sort keys %hash_F2_2){ my $v = $hash_F2_2{$k}; #print RESULTS "$k\t$v\n"; for (my $i=1; $i <= $v; $i++){ push (@{$hash_F2_3{$k}}, $i); # this hash (%hash_F2_3) is the hash of origins and PIPs from 1 to max } } LOOP1: foreach my $key(sort keys %hash_F2_3){ LOOP2: foreach my $position (@{$hash_F2_3{$key}}){ LOOP3: foreach my $key1(sort keys %hash_F2_1){ LOOP4: foreach my $position1(@{$hash_F2_1{$key1}}) +{ LOOP1: foreach my $key(sort keys %hash_F2_3){ LOOP2: foreach my $position (@{$hash_F2_3{$key}}){ LOOP3: foreach my $key1(sort keys %hash_F2_1){ LOOP4: foreach my $position1(@{$hash_F2_1{$key1}}) +{ if ($key =~ m/$key1/ && $position==$position1 ) { print "$key\t$position\t1\n"; } else { print "$key\t$position\t0\n"; } } } } }
This code does not work appropriately, although my rational seem to be ok.

Replies are listed 'Best First'.
Re^3: Complex Data Structure
by GrandFather (Saint) on Sep 15, 2008 at 01:51 UTC

    If there are only two "interesting" columns, ignore the rest. Consider:

    use strict; use warnings; my $data = <<DATA; CLS_S3_Contig100_st,CLS_S3_Contig100,53,10,0.3717 CLS_S3_Contig100_at,CLS_S3_Contig100,55,11,0.4321 CLS_S3_Contig100_st,CLS_S3_Contig100,57,10,0.3223 CLS_S3_Contig100_at,CLS_S3_Contig100,59,11,0.4055 CLS_S3_Contig100_st,CLS_S3_Contig100,61,11,0.4511 CLS_S3_Contig100_at,CLS_S3_Contig100,63,11,0.474 CLS_S3_Contig10031_st,CLS_S3_Contig10031,53,12,0.5548 CLS_S3_Contig10031_st,CLS_S3_Contig10031,57,10,0.4871 CLS_S3_Contig10031_st,CLS_S3_Contig10031,61,12,0.547 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,62,11,0.5129 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,64,11,0.5789 DATA my %origins; my $numColumns; open my $inFile, '<', \$data; while (<$inFile>) { chomp; next unless length; my @columns = split ','; $numColumns ||= @columns; # Assume first row has correct column co +unt $origins{$columns[1]}[$columns[2] - 1] = \@columns; } close $inFile; for my $oKey (sort keys %origins) { my $origin = $origins{$oKey}; for my $pip (0 .. $#$origin) { my $row = $origin->[$pip]; if (defined $row) { # pip exists in original file print join (",", @$row, '1'), "\n"; } else { # pip doesn't exist in original file print ",$oKey,", $pip + 1, ',' x ($numColumns - 2), "0\n"; } } }

    Prints (with large middle portion skipped):

    ,CLS_S3_Contig100,1,,,0 ,CLS_S3_Contig100,2,,,0 ,CLS_S3_Contig100,3,,,0 ,CLS_S3_Contig100,4,,,0 ,CLS_S3_Contig100,5,,,0 ... CLS_S3_Contig10031_st,CLS_S3_Contig10031,57,10,0.4871,1 ,CLS_S3_Contig10031,58,,,0 ,CLS_S3_Contig10031,59,,,0 ,CLS_S3_Contig10031,60,,,0 CLS_S3_Contig10031_st,CLS_S3_Contig10031,61,12,0.547,1 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,62,11,0.5129,1 ,CLS_S3_Contig10031,63,,,0 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,64,11,0.5789,1

    which demonstrates what I understand you to want.

    The key points are using a HoAoA where the hash is keyed by the origin and the array is indexed by PIP (- 1). Note that Perl generates the missing array elements, but sets them to undef so you can test for defined to see if you encountered the PIP in the original file.


    Perl reduces RSI - it saves typing
      Dear Grand Father, You are genius and your code is perfect, it is filling up the gaps (taking care of even-odd numbers). However, in the range plus/minus 8, for example for PIP=53-337 in the case of Contig100 if I want to have all "1"s what can I do? Lets say 45 to 345 all take 1 and before 45 all 0s. Thanks again, Pedro
      . . . . CLS_S3_Contig100 40 0 CLS_S3_Contig100 41 0 CLS_S3_Contig100 42 0 CLS_S3_Contig100 43 0 CLS_S3_Contig100 44 0 CLS_S3_Contig100 45 0 CLS_S3_Contig100 46 0 CLS_S3_Contig100 47 0 CLS_S3_Contig100 48 0 CLS_S3_Contig100 49 0 CLS_S3_Contig100 50 0 CLS_S3_Contig100 51 0 CLS_S3_Contig100 52 0 CLS_S3_Contig100_st CLS_S3_Contig100 53 10 0.3717 + 1 CLS_S3_Contig100 54 0 CLS_S3_Contig100_at CLS_S3_Contig100 55 11 0.4321 + 1 CLS_S3_Contig100 56 0 CLS_S3_Contig100_st CLS_S3_Contig100 57 10 0.3223 + 1 CLS_S3_Contig100 58 0 CLS_S3_Contig100_at CLS_S3_Contig100 59 11 0.4055 + 1 CLS_S3_Contig100 60 0 CLS_S3_Contig100_st CLS_S3_Contig100 61 11 0.4511 + 1 CLS_S3_Contig100 62 0 CLS_S3_Contig100_at CLS_S3_Contig100 63 11 0.474 + 1 CLS_S3_Contig100 64 0 . . . .data flow...
Re^3: Complex Data Structure
by jwkrahn (Abbot) on Sep 15, 2008 at 01:57 UTC

    Your sample data says you have five fields but your code says you have six fields?   So which is it?

    # then go thorough $hash_F2_2 and that has the key (origin) # and max as value and make a multivalue hash as follows: for my $k (sort keys %hash_F2_2){ my $v = $hash_F2_2{$k}; #print RESULTS "$k\t$v\n"; for (my $i=1; $i <= $v; $i++){ push (@{$hash_F2_3{$k}}, $i); # this hash (%hash_F2_3) is the hash of origins and PIPs from +1 to max } }

    You don't need the inner foreach loop, you can just use the range operator.   And if you remove the loop then you don't need to use push either.   And you don't really need the %hash_F2_2 hash either:

    # then go thorough $hash_F2_1 and that has the key (origin) # and make a multivalue hash as follows: my %hash_F2_3; for my $k ( sort keys %hash_F2_1 ) { #print RESULTS "$k\t$hash_F2_1{ $k }\n"; @{ $hash_F2_3{ $k } } = 1 .. max @{ $hash_F2_1{ $k } }; # this hash (%hash_F2_3) is the hash of origins and PIPs from 1 to + max }
      Thanks JWKRAHN, You are right, In the code there are 6 six columns while have 5 in the file. I should have remove $probeseq variable.