select strings with biggest length

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: select strings with biggest length
by davido (Cardinal) on Nov 19, 2006 at 17:37 UTC

I'd use a hash like this (code is documented with inline comments where necessary):

use strict;
use warnings;
use Data::Dumper;

my %sequences;    # This hash will hold sequence ID's and the
                  # associated longest values.

while( defined( my $id = <DATA> ) ) {    # Read the ID from file.
    defined( my $sequence = <DATA> )     # Read the sequence.
        or die "Odd number of lines in DATA.\n";
    chomp for ( $id, $sequence );        # Chomp the input.
    $id =~ s/^>//;                       # Strip the > character
                                         # from the ID.
    if(
        exists( $sequences{ $id } ) and 
        length( $sequences{ $id } ) >= length( $sequence ) 
    ) {
        # Current sequence not longer. Skip to next record.
        next;
    } else {
        # Current sequence is longer.  Keep track of it.
        $sequences{ $id } = $sequence;
    }
}

#  Now %sequences contains all the longest strings for each ID.
#  Print the hash...
print Dumper \%sequences;

__DATA__
>protein1
ASFGTHTRHTHRHTHTRHTRHTR
>protein2
ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE
>protein3
AWEERERGRGRGREGRGREGRRRRRRRRTTHTHTRHRHTRHTR
>protein2
AASEFEFEFE
>protein4
REYTRHTRGRVEVCREVR
[download]

Dave

[reply]
[d/l]

Re: select strings with biggest length
by grep (Monsignor) on Nov 19, 2006 at 17:52 UTC

%protein = (
    protein1 => 'ASFGTHTRHTHRHTHTRHTRHTR',
    protein2 => 'ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE',
    #...
);
[download]

length

use strict;
use warnings;
use Data::Dumper;

my %protein = ();
my $key = '';
foreach my $line (<DATA>) {
    chomp($line);
    
    # Get the key if it's a key line then skip to the next line
    if ($line =~ /^>protein/) {
       $key = $line;
       next;  
    }
    
    if ($key and $line) {   # So this is the protein 
        if (exists($protein{$key})) {  # Have we seen it before
            # Test the length and assign if greater
            $protein{$key} = $line if ( length($protein{$key}) < lengt
+h($line) );
        } else {
            # We haven't seen it before so just assign
            $protein{$key} = $line;
        }
        $key = '';  # Reset Key
    }
}
print Dumper \%protein

__DATA__
>protein1
ASFGTHTRHTHRHTHTRHTRHTR
>protein2
ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE
>protein3
AWEERERGRGRGREGRGREGRRRRRRRRTTHTHTRHRHTRHTR
>protein2
AASEFEFEFE
>protein4
REYTRHTRGRVEVCREVR
[download]

grep

XP matters not. Look at me. Judge me by my XP, do you?

[reply]
[d/l]
[select]

Re: select strings with biggest length
by johngg (Canon) on Nov 19, 2006 at 18:26 UTC

$_

split

map

sort

map

use strict;
use warnings;

use Data::Dumper;

{
    local $/;
    $_ = <DATA>;
}

my %sequences =
   map  { $_->[0], $_->[1] }
   sort { length $a->[1] <=> length $b->[1] }
   map  { [split m{\n}] }
   m    { > ( [^\n]+ \n [^\n]+ ) }gx;

print Dumper(\%sequences);

__END__
>protein1
ASFGTHTRHTHRHTHTRHTRHTR
>protein2
ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE
>protein3
AWEERERGRGRGREGRGREGRRRRRRRRTTHTHTRHRHTRHTR
>protein2
AASEFEFEFE
>protein4
REYTRHTRGRVEVCREVR
[download]

Here's the output

$VAR1 = {
          'protein4' => 'REYTRHTRGRVEVCREVR',
          'protein2' => 'ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE',
          'protein1' => 'ASFGTHTRHTHRHTHTRHTRHTR',
          'protein3' => 'AWEERERGRGRGREGRGREGRRRRRRRRTTHTHTRHRHTRHTR'
        };
[download]

I hope this is of use.

Cheers,

JohnGG

Update: Used the x modifier and spaced things out a bit to aid readability.

[reply]
[d/l]
[select]

Re: select strings with biggest length
by ambrus (Abbot) on Nov 19, 2006 at 18:04 UTC

<proteins.txt sed 'N;s/\n/ /' | perl -wpe 'print length, " ";' | sort 
+-nr | sort -suk2,2 | cut -d\  -f2- | sed 's/ /\n/' >output.txt
[download]

This assumes that the protein names don't contain whitespace, that the order of the output doesn't matter, that it doesn't matter which one of two sequences of equal lengths you choose, and it might also need gnu sed. Any of these could be fixed easily.

[reply]
[d/l]

Re: select strings with biggest length
by talexb (Chancellor) on Nov 19, 2006 at 17:20 UTC

Sounds like you've got a good handle on the algorithim. What Perl code have you written so far to solve this problem?

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

[reply]