Grig has asked for the wisdom of the Perl Monks concerning the following question:

Good day, dear Monks!

Unfortinatly I use Perl rather rarely. At the current moment I'm dealing with genomic coordinates. I have two large arrays with data. The fist contains some coordinates and the second contains distances. Among coordinates there are unique ones and repeats. Although distance value for every coordinate is unique. I would like to compose the final set of unique coordinates and corresponding distances. The main thing is excluding the repeating coordinates. The condition is the following: to chose those pare of coordinate - distance which has the smallest distance value.

So the example of the data the following:

coordinate distance 567 344 1345 567 2346 78 3456 67 3456 789 4678 45 5349 6 6700 124 6700 50 8964 560

I have written some very simple script

$coord = 'coordinate.txt'; open(COORD, $coord) || die("Couldn't read file $coord\n"); $dist = 'dist.txt'; open(DIST, $dist) || die("Couldn't read file $dist\n"); @d = <DIST>; @c = <COORD>; for ( $i = 0; $i <= 13; $i++ ) { if ( $c[$i] != $c[$i+1] ) { printf "%d %d\n", $c[$i], $d[$i]; } }

It selects the last line of repeat without comparing the distance values. How would you advise to change it to get only unique coordinetes-distances pairs. And to chose those pare of coordinate - distance which has the smallest distance value if there are repetative coordinates.

So for those data, given earlier I try to get the output like this:
coordinate distance 567 344 1345 567 2346 78 3456 67 4678 45 5349 6 6700 50 8964 560

Thank you in advance!

Replies are listed 'Best First'.
Re: Repeats exclusion
by suhailck (Friar) on Sep 12, 2010 at 06:12 UTC
    Use a hash of array as below,

    perl -T my %hash;while(<DATA>){ my($cord,$dist)=split; push @{$hash{$cord}},$dist;} print join "\n",map { $_."\t\t".(sort {$a <=> $b} @{$hash{$_}})[0] } sort {$a <=> $b}keys %hash __DATA__ 567 344 1345 567 2346 78 3456 67 3456 789 4678 45 5349 6 6700 124 6700 50 8964 560 567 344 1345 567 2346 78 3456 67 4678 45 5349 6 6700 50 8964 560

      Thank you very much!

      And how this script could be changed if it is nesessry to include additional variables to the hash.

      For example, chromosome number, number of certain elements and so on? Let it be variables $chr, $exons $palindromes. At the same moment those lines should be excluded that contain repetative coord value according to the distance value (the lines with the smalest distance value should remain), as your script already does.

      coord dist chr exons palindromes 567 344 5 7 8 1345 567 5 8 123 2346 78 12 1 567 3456 67 10 1 5 3456 789 10 3 6 4678 45 6 2 0 5349 6 8 2 14 6700 124 13 8 56 6700 50 13 1 4 8964 560 2 18 8
      So the output will be something like this:
      coord dist chr exons palindromes 567 344 5 7 8 1345 567 5 8 123 2346 78 12 1 567 3456 67 10 1 5 4678 45 6 2 0 5349 6 8 2 14 6700 50 13 1 4 8964 560 2 18 8

      Thank you once more!

        You know, first asking "how do I do X", and then after getting the answer coming back with "yeah, but I didn't really want X, I want Y, Z and W" doesn't win you many friends.

        Next time, ask what you want in the first place. Then you may get useful answers. Now you've just wasted someones time, and still not have a solution.

        The code is modified for your new requirement

        use strict; my %hash;while(<DATA>){ my($cord,@others)=split; push @{$hash{$cord}},\@others;} print map {my $__=$_; map {$__," @{$hash{$__}->[$_->[0]]}","\n"} (sort {$a->[1] <=> $b->[1]} map{[$_,${$hash{$_}}->[$_]->[0]]} 0 .. $#{$hash{$__}})[0]} sort {$a <=> $b}keys %hash __DATA__ 567 344 5 7 8 1345 567 5 8 123 2346 78 12 1 567 3456 67 10 1 5 3456 789 10 3 6 4678 45 6 2 0 5349 6 8 2 14 6700 124 13 8 56 6700 50 13 1 4 8964 560 2 18 8


Re: Repeats exclusion
by repellent (Priest) on Sep 12, 2010 at 07:21 UTC
    use warnings; use strict; open(my $COORD, "<", "coordinate.txt") or die("Couldn't read coordinate file - $!"); open(my $DIST, "<", "dist.txt") or die("Couldn't read distance file - $!"); my %coord; until (eof $COORD) { my $c = (split " ", scalar <$COORD>)[0]; my $d = (split " ", scalar <$DIST>)[0]; my $p = $coord{$c}; $coord{$c} = $d if !defined($p) || $d < $p; } close($COORD); close($DIST); print "$_\t$coord{$_}\n" for sort { $a <=> $b } keys %coord;

    Update: Better handling of data using split.

      Thank you for your help!

      Sorry, but your script is excluding only the second repeat, so the actual output looks like this:

      567 344 1345 567 2346 78 3456 67 3456 789 4678 45 5349 6 6700 50 8964 560
      Thank you once more.
        You may have trailing spaces in the coordinate file. I've updated the code to be more robust in this regard.
Re: Repeats exclusion
by jwkrahn (Abbot) on Sep 12, 2010 at 08:16 UTC
    my $coord = 'coordinate.txt'; my $dist = 'dist.txt'; open COORD, '<', $coord or die "Couldn't read file $coord: $!"; open DIST, '<', $dist or die "Couldn't read file $dist: $!"; my %data; @data{ <COORD> } = <DIST>; for my $key ( keys %data ) { print "$key $data{$key}\n"; }

      If later distances are greater than earlier ones, they will persist, and so the OPs goal of finding the lowest distance associated with each coordinate will not be achieved.

Re: Repeats exclusion
by AnomalousMonk (Archbishop) on Sep 14, 2010 at 00:11 UTC

    This is a non-Perl solution that takes advantage of the (possibly adventitious) structure of the example input data file in which the coordinate field of each record is in numerically ascending order; I assume you want to preserve this order. If, in addition, the distance fields were ordered in the same way, the desired output could be generated by excluding all records that repeat the coordinate field.

    The following solution uses the GNU utilities sort and uniq. It may be handier or quicker than a Perl solution, but is also less flexible and extensible. (Note: The following is tested with a GNU-ish port of these utilities to Windoze: best I can do right now.)

    >sort -g -k1 -k2 859805.dat | uniq -w12 > 859805.least >cat 859805.least 567 344 5 7 8 1345 567 5 8 123 2346 78 12 1 567 3456 67 10 1 5 4678 45 6 2 0 5349 6 8 2 14 6700 50 13 1 4 8964 560 2 18 8

    Also Note: The uniq  -w switch, which seems to be supported by GNU/Linux, may not be supported in other implementations.

Re: Repeats exclusion
by jfraire (Beadle) on Sep 14, 2010 at 04:47 UTC

    This is my go at your questions. For the first one:

    #!/usr/bin/perl -w use strict; use warnings; my %hash; while (<DATA>) { my ($coord, $dist) = split; $hash{$coord} = $dist unless defined $hash{$coord} && $dist > $hash{$coord}; } print "$_\t$hash{$_}\n" foreach (sort { $a <=> $b } keys %hash); __DATA__ 567 344 1345 567 2346 78 3456 67 3456 789 4678 45 5349 6 6700 124 6700 50 8964 560

    What's important to notice here is that the distance for the coordinate is saved every time except when the coordinate already has a distance and the new distance is larger than the one already saved.

    For the second question, where you add more columns... well, you need either a hash or an array to accommodate those. But the idea is the same:

    #!/usr/bin/perl -w use strict; use warnings; my %hash; while (<DATA>) { my ($coord, @cols) = split; $hash{$coord} = \@cols unless (defined $hash{$coord} && $cols[0] > $hash{$coord}[0]); } print join "\t", qw(coord dist chr exons palindromes), "\n"; print join "\t", $_, @{$hash{$_}},"\n" foreach (sort {$a <=> $b} keys %hash); __DATA__ 567 344 5 7 8 1345 567 5 8 123 2346 78 12 1 567 3456 67 10 1 5 3456 789 10 3 6 4678 45 6 2 0 5349 6 8 2 14 6700 124 13 8 56 6700 50 13 1 4 8964 560 2 18 8

    Here I chose a hash of arrays.

Re: Repeats exclusion
by TomDLux (Vicar) on Sep 15, 2010 at 18:56 UTC

    My approach is to filter records while reading data.

    use warnings; use strict; my $format = "%-11s %-8s\n"; open my $data, '<', 'data.txt' or die "Could not open 'data.txt': $!\n"; my %dataset; RECORD: while ( my $line = <$data> ) { chomp $line; my ( $coord, $dist ) = split ',', $line; next RECORD if exists $dataset{$coord} and $dataset{$coord} < $dist; $dataset{$coord} = $dist; } printf $format, 'coordinate', 'distance'; printf $format, $_, $dataset{$_} for sort { $a <=> $b } keys %dataset;
    • I turn on warnings and strict as soon as my program grows beyond a few lines, because I don't need to spend 2 hours investigating something strange which turns out to be a typo.
    • I'm not sure how specific your format needs to be, so I assumed it was hard-coded column widths and used printf with a format string , rather than print or say.
    • Read in lines from a comma-separated file, discard the newline and split on comma into temporary variables.
    • If the distance is not the shortest on record for this coordinate, jump to the next input record. Although you can use next or last or redo without a label, I like to always specify what I'm next-ing or last-ing or redo-ing, especially if the loop is long or if it's a nested loop. Save the data if we haven't bailed.
    • Extract the keys from the hash, sort them numerically, and print out the keys and values.
    • I never actually close input files ... this case is too short and fast to care, and in more complicated instances I would have a readfile() routine, which would automatically close the file when it went out of scope.
    • You can get more information on next, last, sort and other stuff by typing perldoc -f sort or whatever the command name may be, into an xterm window, or you can use the perldoc web site

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Re: Repeats exclusion
by TomDLux (Vicar) on Sep 15, 2010 at 19:32 UTC

    To handle multiple fields, I have the hash store a hash of values, rather than a single value. That way I can access the fields by name. The top level is still controlled by coordinate name. The changes from the previous version are minor:

    use warnings; use strict; open my $data, '<', 'alldata.txt' or die "Could not open 'alldata.txt': $!\n"; my %dataset; RECORD: while ( my $line = <$data> ) { chomp $line; next RECORD unless $line =~ m{\d}; my ( $coord, $dist, $chr, $exons, $pals ) = split /\s+/, $line; next RECORD if exists $dataset{$coord} and $dataset{$coord}{dist} < $dist; $dataset{$coord} = { dist => $dist, chr => $chr, exons => $exons, palindromes => $pals, }; } say "coord\tdist\tchr\texons\tpalindromes"; say "$_\t$dataset{$_}{dist}\t$dataset{$_}{chr}\t$dataset{$_}{exons}\t" . "$dataset{$_}{palindromes} " for sort { $a <=> $b } keys %dataset;
    • I just cut-and-paste your dataset, rather than converting to comma-separated, so I split on the regexp /\s+/ ... one or more spaces.
    • I had blank lines at the end of the file. Rather than modify the file, I ignore lines that don't contain numbers. It's frequently useful to also skip lines that contain a '#', that way you can have headers and comments in the data file.
    • I create an anonymous hash of all the data other than the coordinate, and then use the coordinate to identify the top level entries. Note also that you can add more comparison statements, for example if you have two records tied for distance, you could test for odd number of exons and do a next RECORD on that
    • I didn't bother with a format string, too much counting. But I did split the string across two lines, and use a '.' to join them ... I don't like over-long code lines.

    The entire data structure, according to Data::Dumper:

    $VAR1 = { '6700' => { 'palindromes' => '4', 'chr' => '13', 'exons' => '1', 'dist' => 50 }, '4678' => { 'palindromes' => '0', 'chr' => '6', 'exons' => '2', 'dist' => '45' }, '2346' => { 'palindromes' => '567', 'chr' => '12', 'exons' => '1', 'dist' => '78' }, '5349' => { 'palindromes' => '14', 'chr' => '8', 'exons' => '2', 'dist' => '6' }, '3456' => { 'palindromes' => '5', 'chr' => '10', 'exons' => '1', 'dist' => 67 }, '567' => { 'palindromes' => '8', 'chr' => '5', 'exons' => '7', 'dist' => '344' }, '1345' => { 'palindromes' => '123', 'chr' => '5', 'exons' => '8', 'dist' => '567' }, '8964' => { 'palindromes' => '8', 'chr' => '2', 'exons' => '18', 'dist' => '560' } };

    P.S. The debugger is your friend: perl -d program.pl.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.