mSe has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a little program that will ultimately weed out blast hits that are too similar. The idea is to have it read in a tab-delimited line of data, then split it up into an array and do the same for the next line of data. If the first columns are the same for both lines (arrays), push the array to another array, looping through until we have an array of arrays that have the same first column value. Then I'd like to compare the values in the 2nd column of the arrays in that array. If they are equal, we don't need to do anything, but if they are different, then the values in the second-to-last column must have a difference of 10. Here is the code so far (doesn't work yet):

use strict; use warnings; open(input0, "<e_d.txt"); open(output0, ">e_h.txt"); my $colNum=0; my $limit=10; my @arrayEquals; my $line = <input0>; WHILE: while($line ne undef) { s/\r?\n//; my @array = split /\t/, $line; my $followingLine = <input0>; my @followingLineArray=split /\t/, $followingLine; if( $array[$colNum] eq $followingLineArray[$colNum]){ print "match\n"; push (my @arrayEquals, @array); } else{ push (@arrayEquals, @array); for my $i(0 .. $#arrayEquals){ my $colNum=1 if ($arrayEquals[$i][$colNum] eq $arrayEquals[$i+1][$colNu +m]){ next; }else{ my $colNo=-2; if (($arrayEquals[$i][$colNo] - $arrayEquals[$i+1][$co +lNo]) < $limit) { #not enough difference so won't keep any of the lin +es in @arrayEquals $line = $followingLine; next WHILE; } } } print output0 @arrayEquals; #has difference, so keep the v +alues $line = $followingLine; } }

So for example, say I get to this part of my data (some column removed for simplicity):

KN-1791-LAST_rep_c7834 IMGA|Medtr4g125100.1 2e-139 497 KN-1791-LAST_rep_c7834 IMGA|Medtr4g125100.1 4e-46 187 KN-1791-LAST_rep_c7834 IMGA|Medtr4g125100.2 4e-46 187

I'd be trying to compare IMGA|Medtr4g125100.1-> 2e-139 and IMGA|Medtr4g125100.2-> 4e-46 (false positive) and then IMGA|Medtr4g125100.1-> 4e-46 IMGA|Medtr4g125100.2-> 4e-46 (not a difference of 10 so throw them all out) So far accessing the 2D arrays correctly are giving me trouble, but I think my bigger question and the reason for all the explanation would be that I feel like I'm not using everything perl has to offer, because I just don't know enough. So I was also wondering if there is a better way to do this. Thanks for any input

Replies are listed 'Best First'.
Re: Accessing 2D array values and comparing
by toolic (Bishop) on Jul 04, 2012 at 21:57 UTC
    I'm not sure what you're trying to do, but I think you should change:
    push (my @arrayEquals, @array);

    to:

    push (@arrayEquals, @array);
    You already declared @arrayEquals at the top of your code.
Re: Accessing 2D array values and comparing
by aaron_baugher (Curate) on Jul 05, 2012 at 01:14 UTC

    I don't completely understand your requirements either, but I'll take a stab. You'll probably end up with multiple arrays of arrays, right -- one for each unique value in the first column? In other words, for the value KN-1791-LAST_rep_c7834 you could have an array of arrays, but then there will be another for the next different value found in column 1, and so on. In that case, I suspect you really want a hash of arrays of arrays (HoAoA), keyed on that column 1 value. Or perhaps, once you've built an array of arrays for a particular column 1 value, you can go ahead and print it out or do whatever you need to do with it, and be done with it? I can't really tell.

    It's also not clear what you mean by "a difference of 10", since you didn't give an example like that, and I don't know what you mean by a "false positive." Maybe you could give a longer sample of your input data (10-12 lines) and what you would expect the output to look like, and get some advice on the best way to get there.

    One likely problem, code-wise: this line (below) doesn't make an array of arrays, as you may be expecting. It pushes the values in the second array onto the end of the first array, leaving it one-dimensional. To get an array of arrays, you would push a reference to your sub-array onto the first one, like in my second line:

    push (@arrayEquals, @array); # just makes the array longer push @arrayEquals, [@array]; # makes an array of arrays

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Accessing 2D array values and comparing
by Kenosis (Priest) on Jul 05, 2012 at 06:57 UTC

    First, you use just s/\r?\n//;, but perhaps you meant $line =~ s/\r?\n//;

    You've been given excellent feedback. I agree with aaron_baugher that a hash of arrays of arrays (HoAoA) is likely a good fit for your program. With that in mind, consider the following:

    use Modern::Perl; use Data::Dumper; my %HoAoArrays; while (<DATA>) { chomp; my ( $col0, @cols1_3 ) = split /\t/; push @{ $HoAoArrays{$col0} }, \@cols1_3; } for my $key ( keys %HoAoArrays ) { my $numElements= @{ $HoAoArrays{$key} }; for ( my $i = 0 ; $i < $numElements; $i++ ) { say '${ $HoAoArrays{' . $key . '} }[' . $i . ']->[1] = ' . ${ $HoAoArrays{$key} }[$i]->[1]; } say ''; } say '', Dumper \%HoAoArrays; __DATA__ KN-1791-LAST_rep_c7834 IMGA|Medtr4g125100.1 2e-139 497 KN-1791-LAST_rep_c7834 IMGA|Medtr4g125100.1 4e-46 187 KN-1791-LAST_rep_c7834 IMGA|Medtr4g125100.2 4e-46 187 KN-1792-LAST_rep_c7834 IMGA|Medtr4g125100.1 2e-150 497 KN-1792-LAST_rep_c7834 IMGA|Medtr4g125100.1 4e-37 187 KN-1792-LAST_rep_c7834 IMGA|Medtr4g125100.2 4e-37 187 KN-1792-LAST_rep_c7834 IMGA|Medtr4g125100.3 4e-35 188

    Output:

    ${ $HoAoArrays{KN-1792-LAST_rep_c7834} }[0]->[1] = 2e-150 ${ $HoAoArrays{KN-1792-LAST_rep_c7834} }[1]->[1] = 4e-37 ${ $HoAoArrays{KN-1792-LAST_rep_c7834} }[2]->[1] = 4e-37 ${ $HoAoArrays{KN-1792-LAST_rep_c7834} }[3]->[1] = 4e-35 ${ $HoAoArrays{KN-1791-LAST_rep_c7834} }[0]->[1] = 2e-139 ${ $HoAoArrays{KN-1791-LAST_rep_c7834} }[1]->[1] = 4e-46 ${ $HoAoArrays{KN-1791-LAST_rep_c7834} }[2]->[1] = 4e-46 $VAR1 = { 'KN-1792-LAST_rep_c7834' => [ [ 'IMGA|Medtr4g125100.1', '2e-150', '497' ], [ 'IMGA|Medtr4g125100.1', '4e-37', '187' ], [ 'IMGA|Medtr4g125100.2', '4e-37', '187' ], [ 'IMGA|Medtr4g125100.3', '4e-35', '188' ] ], 'KN-1791-LAST_rep_c7834' => [ [ 'IMGA|Medtr4g125100.1', '2e-139', '497' ], [ 'IMGA|Medtr4g125100.1', '4e-46', '187' ], [ 'IMGA|Medtr4g125100.2', '4e-46', '187' ] ] };

    The while loop splits each tab-delimited line into a scalar ($col0) and an array (@cols1_3). The @{} notation enclosing the hash (i.e., @{ $HoAoArrays{$col0} }) tells perl to treat it as an array, and an array reference is pushed onto it.

    The subsequent for loop shows how to deference the array references. Finally, Dumper is used to visually represent the created data structure: { } = hash and [ ] = array. Dumper shows a hash with two keys. Each key has a corresponding array as its value, and each array has arrays as its elements (HoAoA).

    Hope this helps!

Re: Accessing 2D array values and comparing
by ig (Vicar) on Jul 05, 2012 at 07:18 UTC

    I too am uncertain about your objective, but from what I see I suspect your solution is more complicated than it needs to be. The following is much simpler, and might be what you want. If not, it might be a basis for modification to get what you want. To that end, I suggest starting with something simple that works, and then making one change at a time to make it closer to what you want, making sure your program runs as expected at each step.

    Anyway, here is something you might consider:

    use strict; use warnings; use Data::Dumper; my $minimum_difference = 10; my $input_filename = 'e_d.txt'; open(my $input_fh, '<', $input_filename) or die "$input_filename: $!"; my $previous_fields; while (my $line = <$input_fh>) { chomp($line); my $fields = [ split(/\t/, $line) ]; if( defined($previous_fields) and $previous_fields->[0] eq $fields->[0] and $previous_fields->[1] ne $fields->[1] and $previous_fields->[-2] - $fields->[-2] < $minimum_difference ) { print "Failed test:\n" . Dumper([ $previous_fields, $fields ]) + . "\n\n"; } $previous_fields = $fields; }

    I have made many assumptions. Most significant is the assumption that you are only interested in differences between consecutive lines in the input file.

    I have used Data::Dumper to dump both arrays of fields when the test fails. This module is very helpful when you are developing code that deals with data structures. You also might want to read perldsc.

Re: Accessing 2D array values and comparing
by mSe (Initiate) on Jul 05, 2012 at 12:03 UTC

    Thanks so much for the great replies, going to give things an overhaul for sure, and I'll know how to clarify things for future questions! Thanks again

Re: Accessing 2D array values and comparing
by Marshall (Canon) on Jul 07, 2012 at 03:51 UTC
    I am not sure what you are doing, but BLAST: "Basic Local Alignment Search Tool" is a bio "buzzword" and there are a lot of tools to deal with that.

    I don't use BioPerl and I really don't know what the possibilities are. But I would start with: Google Perl Blast.

    I feel like I'm not using everything Perl has to offer , is probably right! Bio Informatics and Perl fit very well together. Perl can do a lot more than just solve this problem "at hand". I think that you will find more powerful tools than we can present here for the "problem of the day".