Repeats exclusion

Replies are listed 'Best First'.
Re: Repeats exclusion by suhailck (Friar) on Sep 12, 2010 at 06:12 UTC
Use a hash of array as below, `perl -T my %hash;while(<DATA>){ my($cord,$dist)=split; push @{$hash{$cord}},$dist;} print join "\n",map { $_."\t\t".(sort {$a <=> $b} @{$hash{$_}})[0] } sort {$a <=> $b}keys %hash __DATA__ 567 344 1345 567 2346 78 3456 67 3456 789 4678 45 5349 6 6700 124 6700 50 8964 560 567 344 1345 567 2346 78 3456 67 4678 45 5349 6 6700 50 8964 560` [download]	[reply] [d/l]
Re^2: Repeats exclusion by Grig (Novice) on Sep 12, 2010 at 16:07 UTC
Thank you very much! And how this script could be changed if it is nesessry to include additional variables to the hash. For example, chromosome number, number of certain elements and so on? Let it be variables $chr, $exons $palindromes. At the same moment those lines should be excluded that contain repetative coord value according to the distance value (the lines with the smalest distance value should remain), as your script already does. `coord dist chr exons palindromes 567 344 5 7 8 1345 567 5 8 123 2346 78 12 1 567 3456 67 10 1 5 3456 789 10 3 6 4678 45 6 2 0 5349 6 8 2 14 6700 124 13 8 56 6700 50 13 1 4 8964 560 2 18 8` [download] So the output will be something like this: `coord dist chr exons palindromes 567 344 5 7 8 1345 567 5 8 123 2346 78 12 1 567 3456 67 10 1 5 4678 45 6 2 0 5349 6 8 2 14 6700 50 13 1 4 8964 560 2 18 8` [download] Thank you once more!	[reply] [d/l] [select]
Re^3: Repeats exclusion by JavaFan (Canon) on Sep 13, 2010 at 14:12 UTC
You know, first asking "how do I do X", and then after getting the answer coming back with "yeah, but I didn't really want X, I want Y, Z and W" doesn't win you many friends. Next time, ask what you want in the first place. Then you may get useful answers. Now you've just wasted someones time, and still not have a solution.	[reply]
Re^3: Repeats exclusion by suhailck (Friar) on Sep 14, 2010 at 05:52 UTC
The code is modified for your new requirement `use strict; my %hash;while(<DATA>){ my($cord,@others)=split; push @{$hash{$cord}},\@others;} print map {my $__=$_; map {$__," @{$hash{$__}->[$_->[0]]}","\n"} (sort {$a->[1] <=> $b->[1]} map{[$_,${$hash{$_}}->[$_]->[0]]} 0 .. $#{$hash{$__}})[0]} sort {$a <=> $b}keys %hash __DATA__ 567 344 5 7 8 1345 567 5 8 123 2346 78 12 1 567 3456 67 10 1 5 3456 789 10 3 6 4678 45 6 2 0 5349 6 8 2 14 6700 124 13 8 56 6700 50 13 1 4 8964 560 2 18 8` [download]	[reply] [d/l]
Re: Repeats exclusion by repellent (Priest) on Sep 12, 2010 at 07:21 UTC
`use warnings; use strict; open(my $COORD, "<", "coordinate.txt") or die("Couldn't read coordinate file - $!"); open(my $DIST, "<", "dist.txt") or die("Couldn't read distance file - $!"); my %coord; until (eof $COORD) { my $c = (split " ", scalar <$COORD>)[0]; my $d = (split " ", scalar <$DIST>)[0]; my $p = $coord{$c}; $coord{$c} = $d if !defined($p) \|\| $d < $p; } close($COORD); close($DIST); print "$_\t$coord{$_}\n" for sort { $a <=> $b } keys %coord;` [download] Update: Better handling of data using split.	[reply] [d/l]
Re^2: Repeats exclusion by Grig (Novice) on Sep 12, 2010 at 11:22 UTC
Thank you for your help! Sorry, but your script is excluding only the second repeat, so the actual output looks like this: `567 344 1345 567 2346 78 3456 67 3456 789 4678 45 5349 6 6700 50 8964 560` [download] Thank you once more.	[reply] [d/l]
Re^3: Repeats exclusion by repellent (Priest) on Sep 12, 2010 at 11:43 UTC
You may have trailing spaces in the coordinate file. I've updated the code to be more robust in this regard.	[reply]
Re: Repeats exclusion by jwkrahn (Abbot) on Sep 12, 2010 at 08:16 UTC
`my $coord = 'coordinate.txt'; my $dist = 'dist.txt'; open COORD, '<', $coord or die "Couldn't read file $coord: $!"; open DIST, '<', $dist or die "Couldn't read file $dist: $!"; my %data; @data{ <COORD> } = <DIST>; for my $key ( keys %data ) { print "$key $data{$key}\n"; }` [download]	[reply] [d/l]
Re^2: Repeats exclusion by BrowserUk (Patriarch) on Sep 12, 2010 at 09:51 UTC
If later distances are greater than earlier ones, they will persist, and so the OPs goal of finding the lowest distance associated with each coordinate will not be achieved.	[reply]
Re: Repeats exclusion by AnomalousMonk (Archbishop) on Sep 14, 2010 at 00:11 UTC
This is a non-Perl solution that takes advantage of the (possibly adventitious) structure of the example input data file in which the coordinate field of each record is in numerically ascending order; I assume you want to preserve this order. If, in addition, the distance fields were ordered in the same way, the desired output could be generated by excluding all records that repeat the coordinate field. The following solution uses the GNU utilities sort and uniq. It may be handier or quicker than a Perl solution, but is also less flexible and extensible. (Note: The following is tested with a GNU-ish port of these utilities to Windoze: best I can do right now.) `>sort -g -k1 -k2 859805.dat \| uniq -w12 > 859805.least >cat 859805.least 567 344 5 7 8 1345 567 5 8 123 2346 78 12 1 567 3456 67 10 1 5 4678 45 6 2 0 5349 6 8 2 14 6700 50 13 1 4 8964 560 2 18 8` [download] Also Note: The uniq `-w` switch, which seems to be supported by GNU/Linux, may not be supported in other implementations.	[reply] [d/l] [select]
Re: Repeats exclusion by jfraire (Beadle) on Sep 14, 2010 at 04:47 UTC
This is my go at your questions. For the first one: `#!/usr/bin/perl -w use strict; use warnings; my %hash; while (<DATA>) { my ($coord, $dist) = split; $hash{$coord} = $dist unless defined $hash{$coord} && $dist > $hash{$coord}; } print "$_\t$hash{$_}\n" foreach (sort { $a <=> $b } keys %hash); __DATA__ 567 344 1345 567 2346 78 3456 67 3456 789 4678 45 5349 6 6700 124 6700 50 8964 560` [download] What's important to notice here is that the distance for the coordinate is saved every time except when the coordinate already has a distance and the new distance is larger than the one already saved. For the second question, where you add more columns... well, you need either a hash or an array to accommodate those. But the idea is the same: `#!/usr/bin/perl -w use strict; use warnings; my %hash; while (<DATA>) { my ($coord, @cols) = split; $hash{$coord} = \@cols unless (defined $hash{$coord} && $cols[0] > $hash{$coord}[0]); } print join "\t", qw(coord dist chr exons palindromes), "\n"; print join "\t", $_, @{$hash{$_}},"\n" foreach (sort {$a <=> $b} keys %hash); __DATA__ 567 344 5 7 8 1345 567 5 8 123 2346 78 12 1 567 3456 67 10 1 5 3456 789 10 3 6 4678 45 6 2 0 5349 6 8 2 14 6700 124 13 8 56 6700 50 13 1 4 8964 560 2 18 8` [download] Here I chose a hash of arrays.	[reply] [d/l] [select]
Re: Repeats exclusion by TomDLux (Vicar) on Sep 15, 2010 at 18:56 UTC
My approach is to filter records while reading data. `use warnings; use strict; my $format = "%-11s %-8s\n"; open my $data, '<', 'data.txt' or die "Could not open 'data.txt': $!\n"; my %dataset; RECORD: while ( my $line = <$data> ) { chomp $line; my ( $coord, $dist ) = split ',', $line; next RECORD if exists $dataset{$coord} and $dataset{$coord} < $dist; $dataset{$coord} = $dist; } printf $format, 'coordinate', 'distance'; printf $format, $_, $dataset{$_} for sort { $a <=> $b } keys %dataset;` [download] I turn on warnings and strict as soon as my program grows beyond a few lines, because I don't need to spend 2 hours investigating something strange which turns out to be a typo. I'm not sure how specific your format needs to be, so I assumed it was hard-coded column widths and used printf with a format string , rather than print or say. Read in lines from a comma-separated file, discard the newline and split on comma into temporary variables. If the distance is not the shortest on record for this coordinate, jump to the next input record. Although you can use next or last or redo without a label, I like to always specify what I'm next-ing or last-ing or redo-ing, especially if the loop is long or if it's a nested loop. Save the data if we haven't bailed. Extract the keys from the hash, sort them numerically, and print out the keys and values. I never actually close input files ... this case is too short and fast to care, and in more complicated instances I would have a readfile() routine, which would automatically close the file when it went out of scope. You can get more information on next, last, sort and other stuff by typing perldoc -f sort or whatever the command name may be, into an xterm window, or you can use the perldoc web site As Occam said: Entia non sunt multiplicanda praeter necessitatem.	[reply] [d/l]
Re: Repeats exclusion by TomDLux (Vicar) on Sep 15, 2010 at 19:32 UTC
To handle multiple fields, I have the hash store a hash of values, rather than a single value. That way I can access the fields by name. The top level is still controlled by coordinate name. The changes from the previous version are minor: use warnings; use strict; open my $data, '<', 'alldata.txt' or die "Could not open 'alldata.txt': $!\n"; my %dataset; RECORD: while ( my $line = <$data> ) { chomp $line; next RECORD unless $line =~ m{\d}; my ( $coord, $dist, $chr, $exons, $pals ) = split /\s+/, $line; next RECORD if exists $dataset{$coord} and $dataset{$coord}{dist} < $dist; $dataset{$coord} = { dist => $dist, chr => $chr, exons => $exons, palindromes => $pals, }; } say "coord\tdist\tchr\texons\tpalindromes"; say "$_\t$dataset{$_}{dist}\t$dataset{$_}{chr}\t$dataset{$_}{exons}\t" . "$dataset{$_}{palindromes} " for sort { $a <=> $b } keys %dataset; [download] I just cut-and-paste your dataset, rather than converting to comma-separated, so I split on the regexp /\s+/ ... one or more spaces. I had blank lines at the end of the file. Rather than modify the file, I ignore lines that don't contain numbers. It's frequently useful to also skip lines that contain a '#', that way you can have headers and comments in the data file. I create an anonymous hash of all the data other than the coordinate, and then use the coordinate to identify the top level entries. Note also that you can add more comparison statements, for example if you have two records tied for distance, you could test for odd number of exons and do a next RECORD on that I didn't bother with a format string, too much counting. But I did split the string across two lines, and use a '.' to join them ... I don't like over-long code lines. The entire data structure, according to Data::Dumper: $VAR1 = { '6700' => { 'palindromes' => '4', 'chr' => '13', 'exons' => '1', 'dist' => 50 }, '4678' => { 'palindromes' => '0', 'chr' => '6', 'exons' => '2', 'dist' => '45' }, '2346' => { 'palindromes' => '567', 'chr' => '12', 'exons' => '1', 'dist' => '78' }, '5349' => { 'palindromes' => '14', 'chr' => '8', 'exons' => '2', 'dist' => '6' }, '3456' => { 'palindromes' => '5', 'chr' => '10', 'exons' => '1', 'dist' => 67 }, '567' => { 'palindromes' => '8', 'chr' => '5', 'exons' => '7', 'dist' => '344' }, '1345' => { 'palindromes' => '123', 'chr' => '5', 'exons' => '8', 'dist' => '567' }, '8964' => { 'palindromes' => '8', 'chr' => '2', 'exons' => '18', 'dist' => '560' } }; [download] P.S. The debugger is your friend: perl -d program.pl. As Occam said: Entia non sunt multiplicanda praeter necessitatem.	[reply] [d/l] [select]

perl -T
my %hash;while(<DATA>){ my($cord,$dist)=split; 
push @{$hash{$cord}},$dist;} 
print join "\n",map { $_."\t\t".(sort {$a <=> $b} @{$hash{$_}})[0] }
sort {$a <=> $b}keys %hash
__DATA__
567         344  
1345        567
2346        78   
3456        67 
3456        789
4678        45    
5349        6
6700        124 
6700        50
8964        560

        
567        344
1345        567
2346        78
3456        67
4678        45
5349        6
6700        50
8964        560
[download]

[reply]
[d/l]

Thank you very much!

And how this script could be changed if it is nesessry to include additional variables to the hash.

For example, chromosome number, number of certain elements and so on? Let it be variables $chr, $exons $palindromes. At the same moment those lines should be excluded that contain repetative coord value according to the distance value (the lines with the smalest distance value should remain), as your script already does.

coord       dist   chr  exons palindromes
567         344     5   7     8 
1345        567     5   8     123
2346        78      12  1     567
3456        67      10  1     5
3456        789     10  3     6
4678        45      6   2     0
5349        6       8   2     14
6700        124     13  8     56
6700        50      13  1     4
8964        560     2   18    8
[download]

coord       dist   chr  exons palindromes
567         344     5   7     8 
1345        567     5   8     123
2346        78      12  1     567
3456        67      10  1     5
4678        45      6   2     0
5349        6       8   2     14
6700        50      13  1     4
8964        560     2   18    8
[download]

Thank you once more!

[reply]
[d/l]
[select]

Next time, ask what you want in the first place. Then you may get useful answers. Now you've just wasted someones time, and still not have a solution.

[reply]

use strict;
my %hash;while(<DATA>){ my($cord,@others)=split; 
push @{$hash{$cord}},\@others;} 
print map {my $__=$_;
 map {$__," @{$hash{$__}->[$_->[0]]}","\n"}
(sort {$a->[1] <=> $b->[1]} 
map{[$_,${$hash{$_}}->[$_]->[0]]} 0 .. $#{$hash{$__}})[0]}
sort {$a <=> $b}keys %hash
__DATA__
567         344     5   7     8 
1345        567     5   8     123
2346        78      12  1     567
3456        67      10  1     5
3456        789     10  3     6
4678        45      6   2     0
5349        6       8   2     14
6700        124     13  8     56
6700        50      13  1     4
8964        560     2   18    8
[download]

[reply]
[d/l]

use warnings;
use strict;

open(my $COORD, "<", "coordinate.txt")
  or die("Couldn't read coordinate file - $!");
open(my $DIST, "<", "dist.txt")
  or die("Couldn't read distance file - $!");

my %coord;

until (eof $COORD)
{
    my $c = (split " ", scalar <$COORD>)[0];
    my $d = (split " ", scalar <$DIST>)[0];

    my $p = $coord{$c};
    $coord{$c} = $d
      if !defined($p) || $d < $p;
}

close($COORD);
close($DIST);

print "$_\t$coord{$_}\n"
  for sort { $a <=> $b } keys %coord;
[download]

Update:

split

[reply]
[d/l]

Thank you for your help!

Sorry, but your script is excluding only the second repeat, so the actual output looks like this:

567              344  
1345            567
2346               78   
3456             67 
3456            789
4678            45    
5349            6
6700            50
8964            560
[download]

[reply]
[d/l]

You may have trailing spaces in the coordinate file. I've updated the code to be more robust in this regard.

[reply]

my $coord = 'coordinate.txt';
my $dist  = 'dist.txt';

open COORD, '<', $coord or die "Couldn't read file $coord: $!";
open DIST,  '<', $dist  or die "Couldn't read file $dist: $!";

my %data;
@data{ <COORD> } = <DIST>;

for my $key ( keys %data ) {
  print  "$key  $data{$key}\n";
  }
[download]

[reply]
[d/l]

If later distances are greater than earlier ones, they will persist, and so the OPs goal of finding the lowest distance associated with each coordinate will not be achieved.

[reply]

This is a non-Perl solution that takes advantage of the (possibly adventitious) structure of the example input data file in which the coordinate field of each record is in numerically ascending order; I assume you want to preserve this order. If, in addition, the distance fields were ordered in the same way, the desired output could be generated by excluding all records that repeat the coordinate field.

The following solution uses the GNU utilities sort and uniq. It may be handier or quicker than a Perl solution, but is also less flexible and extensible. (Note: The following is tested with a GNU-ish port of these utilities to Windoze: best I can do right now.)

>sort -g -k1 -k2 859805.dat | uniq -w12 > 859805.least

>cat 859805.least
567         344     5   7     8
1345        567     5   8     123
2346        78      12  1     567
3456        67      10  1     5
4678        45      6   2     0
5349        6       8   2     14
6700        50      13  1     4
8964        560     2   18    8
[download]

Also Note: The uniq -w switch, which seems to be supported by GNU/Linux, may not be supported in other implementations.

[reply]
[d/l]
[select]

This is my go at your questions. For the first one:

#!/usr/bin/perl -w

use strict;
use warnings;

my %hash;
while (<DATA>) {
    my ($coord, $dist) = split;
    $hash{$coord} = $dist 
        unless defined $hash{$coord} && $dist > $hash{$coord};
}

print "$_\t$hash{$_}\n" foreach (sort { $a <=> $b } keys %hash);


__DATA__
567         344  
1345        567
2346        78   
3456        67 
3456        789
4678        45    
5349        6
6700        124 
6700        50
8964        560
[download]

What's important to notice here is that the distance for the coordinate is saved every time except when the coordinate already has a distance and the new distance is larger than the one already saved.

For the second question, where you add more columns... well, you need either a hash or an array to accommodate those. But the idea is the same:

#!/usr/bin/perl -w

use strict;
use warnings;

my %hash;

while (<DATA>) {
    my ($coord, @cols) = split;
    $hash{$coord} = \@cols
        unless (defined $hash{$coord} && $cols[0] > $hash{$coord}[0]);
}

print join "\t", qw(coord dist chr exons palindromes), "\n";
print join "\t", $_, @{$hash{$_}},"\n" 
    foreach (sort {$a <=> $b} keys %hash);

__DATA__
567         344     5   7     8 
1345        567     5   8     123
2346        78      12  1     567
3456        67      10  1     5
3456        789     10  3     6
4678        45      6   2     0
5349        6       8   2     14
6700        124     13  8     56
6700        50      13  1     4
8964        560     2   18    8
[download]

Here I chose a hash of arrays.

[reply]
[d/l]
[select]

My approach is to filter records while reading data.

use warnings;
use strict;

my $format = "%-11s %-8s\n";

open my $data, '<', 'data.txt'
    or die "Could not open 'data.txt': $!\n";

my %dataset;

RECORD:
while ( my $line = <$data> ) {
    chomp $line;
    my ( $coord, $dist ) = split ',', $line;
    next RECORD
       if exists $dataset{$coord} and $dataset{$coord} < $dist;
    $dataset{$coord} = $dist;
}

printf $format, 'coordinate', 'distance';
printf $format, $_,  $dataset{$_}
    for sort { $a <=>  $b } keys %dataset;
[download]

I turn on warnings and strict as soon as my program grows beyond a few lines, because I don't need to spend 2 hours investigating something strange which turns out to be a typo.
I'm not sure how specific your format needs to be, so I assumed it was hard-coded column widths and used printf with a format string , rather than print or say.
Read in lines from a comma-separated file, discard the newline and split on comma into temporary variables.
If the distance is not the shortest on record for this coordinate, jump to the next input record. Although you can use next or last or redo without a label, I like to always specify what I'm next-ing or last-ing or redo-ing, especially if the loop is long or if it's a nested loop. Save the data if we haven't bailed.
Extract the keys from the hash, sort them numerically, and print out the keys and values.
I never actually close input files ... this case is too short and fast to care, and in more complicated instances I would have a readfile() routine, which would automatically close the file when it went out of scope.
You can get more information on next, last, sort and other stuff by typing perldoc -f sort or whatever the command name may be, into an xterm window, or you can use the perldoc web site

As Occam said: Entia non sunt multiplicanda praeter necessitatem.

[reply]
[d/l]

To handle multiple fields, I have the hash store a hash of values, rather than a single value. That way I can access the fields by name. The top level is still controlled by coordinate name. The changes from the previous version are minor:

use warnings;
use strict;

open my $data, '<', 'alldata.txt'
    or die "Could not open 'alldata.txt': $!\n";

my %dataset;

RECORD:
while ( my $line = <$data> ) {
    chomp $line;
    next RECORD
      unless $line =~ m{\d};
    my ( $coord, $dist, $chr, $exons, $pals ) = split /\s+/, $line;
    next RECORD
       if      exists $dataset{$coord} 
          and $dataset{$coord}{dist} < $dist;

    $dataset{$coord} = { dist        => $dist,
                         chr         => $chr,
                         exons       => $exons,
                         palindromes => $pals,
                       };
}

say "coord\tdist\tchr\texons\tpalindromes";


say "$_\t$dataset{$_}{dist}\t$dataset{$_}{chr}\t$dataset{$_}{exons}\t"
  . "$dataset{$_}{palindromes} "
    for sort { $a <=>  $b } keys %dataset;
[download]

I just cut-and-paste your dataset, rather than converting to comma-separated, so I split on the regexp /\s+/ ... one or more spaces.
I had blank lines at the end of the file. Rather than modify the file, I ignore lines that don't contain numbers. It's frequently useful to also skip lines that contain a '#', that way you can have headers and comments in the data file.
I create an anonymous hash of all the data other than the coordinate, and then use the coordinate to identify the top level entries. Note also that you can add more comparison statements, for example if you have two records tied for distance, you could test for odd number of exons and do a next RECORD on that
I didn't bother with a format string, too much counting. But I did split the string across two lines, and use a '.' to join them ... I don't like over-long code lines.

The entire data structure, according to Data::Dumper:

$VAR1 = {
          '6700' => {
                      'palindromes' => '4',
                      'chr' => '13',
                      'exons' => '1',
                      'dist' => 50
                    },
          '4678' => {
                      'palindromes' => '0',
                      'chr' => '6',
                      'exons' => '2',
                      'dist' => '45'
                    },
          '2346' => {
                      'palindromes' => '567',
                      'chr' => '12',
                      'exons' => '1',
                      'dist' => '78'
                    },
          '5349' => {
                      'palindromes' => '14',
                      'chr' => '8',
                      'exons' => '2',
                      'dist' => '6'
                    },
          '3456' => {
                      'palindromes' => '5',
                      'chr' => '10',
                      'exons' => '1',
                      'dist' => 67
                    },
          '567' => {
                     'palindromes' => '8',
                     'chr' => '5',
                     'exons' => '7',
                     'dist' => '344'
                   },
          '1345' => {
                      'palindromes' => '123',
                      'chr' => '5',
                      'exons' => '8',
                      'dist' => '567'
                    },
          '8964' => {
                      'palindromes' => '8',
                      'chr' => '2',
                      'exons' => '18',
                      'dist' => '560'
                    }
        };
[download]

P.S. The debugger is your friend: perl -d program.pl.

As Occam said: Entia non sunt multiplicanda praeter necessitatem.

[reply]
[d/l]
[select]