Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,
I have a file which has the following format:
>name_of_protein
sequence_of_protein
Now, tha thing is that the same protein might appear more than once, say you have:
>protein1
ASFGTHTRHTHRHTHTRHTRHTR
>protein2
ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE
>protein3
AWEERERGRGRGREGRGREGRRRRRRRRTTHTHTRHRHTRHTR
>protein2
AASEFEFEFE
>protein4
>REYTRHTRGRVEVCREVR
What I need to do is: Read the file, when you see a protein (e.g protein1), store the length of the sequence(the line below) and then, scan through the rest of the document and, if you find the same protein name, compare the length of the sequence you found previously with the one you found later and store the initial protein name along with the sequence which has the biggest length.
In my example, the hash should store protein2 with the sequence it finds in the first place, because it is bigger.

Replies are listed 'Best First'.
Re: select strings with biggest length
by davido (Cardinal) on Nov 19, 2006 at 17:37 UTC

    I'd use a hash like this (code is documented with inline comments where necessary):

    use strict; use warnings; use Data::Dumper; my %sequences; # This hash will hold sequence ID's and the # associated longest values. while( defined( my $id = <DATA> ) ) { # Read the ID from file. defined( my $sequence = <DATA> ) # Read the sequence. or die "Odd number of lines in DATA.\n"; chomp for ( $id, $sequence ); # Chomp the input. $id =~ s/^>//; # Strip the > character # from the ID. if( exists( $sequences{ $id } ) and length( $sequences{ $id } ) >= length( $sequence ) ) { # Current sequence not longer. Skip to next record. next; } else { # Current sequence is longer. Keep track of it. $sequences{ $id } = $sequence; } } # Now %sequences contains all the longest strings for each ID. # Print the hash... print Dumper \%sequences; __DATA__ >protein1 ASFGTHTRHTHRHTHTRHTRHTR >protein2 ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE >protein3 AWEERERGRGRGREGRGREGRRRRRRRRTTHTHTRHRHTRHTR >protein2 AASEFEFEFE >protein4 REYTRHTRGRVEVCREVR

    Dave

Re: select strings with biggest length
by grep (Monsignor) on Nov 19, 2006 at 17:52 UTC
    If I'm following you, I would fashion a data structure like this:
    %protein = ( protein1 => 'ASFGTHTRHTHRHTHTRHTRHTR', protein2 => 'ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE', #... );
    Then it's just a matter of looping over the file, looking for an existing key, and running length. Similar to this:
    use strict; use warnings; use Data::Dumper; my %protein = (); my $key = ''; foreach my $line (<DATA>) { chomp($line); # Get the key if it's a key line then skip to the next line if ($line =~ /^>protein/) { $key = $line; next; } if ($key and $line) { # So this is the protein if (exists($protein{$key})) { # Have we seen it before # Test the length and assign if greater $protein{$key} = $line if ( length($protein{$key}) < lengt +h($line) ); } else { # We haven't seen it before so just assign $protein{$key} = $line; } $key = ''; # Reset Key } } print Dumper \%protein __DATA__ >protein1 ASFGTHTRHTHRHTHTRHTRHTR >protein2 ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE >protein3 AWEERERGRGRGREGRGREGRRRRRRRRTTHTHTRHRHTRHTR >protein2 AASEFEFEFE >protein4 REYTRHTRGRVEVCREVR

    grep
    XP matters not. Look at me. Judge me by my XP, do you?

Re: select strings with biggest length
by johngg (Canon) on Nov 19, 2006 at 18:26 UTC
    This may not be a good way to process the sequences, especially if your file is large. However, just to show another way, you can read the whole file into $_ and pull the protein name/sequence pairs out using a regular expression. Then split them on newline in a map, pass into a sort so as to get the pairs in ascending sequence length and finally map the pairs out and assign to a hash. Thus, if there are any duplicate proteins the longest sequence will be assigned to the hash last, overwriting any previous shorter sequence.

    use strict; use warnings; use Data::Dumper; { local $/; $_ = <DATA>; } my %sequences = map { $_->[0], $_->[1] } sort { length $a->[1] <=> length $b->[1] } map { [split m{\n}] } m { > ( [^\n]+ \n [^\n]+ ) }gx; print Dumper(\%sequences); __END__ >protein1 ASFGTHTRHTHRHTHTRHTRHTR >protein2 ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE >protein3 AWEERERGRGRGREGRGREGRRRRRRRRTTHTHTRHRHTRHTR >protein2 AASEFEFEFE >protein4 REYTRHTRGRVEVCREVR

    Here's the output

    $VAR1 = { 'protein4' => 'REYTRHTRGRVEVCREVR', 'protein2' => 'ERYRYTRYHTRHTGEFEWWFEEFFFFREFRGRE', 'protein1' => 'ASFGTHTRHTHRHTHTRHTRHTR', 'protein3' => 'AWEERERGRGRGREGRGREGRRRRRRRRTTHTHTRHRHTRHTR' };

    I hope this is of use.

    Cheers,

    JohnGG

    Update: Used the x modifier and spaced things out a bit to aid readability.

Re: select strings with biggest length
by ambrus (Abbot) on Nov 19, 2006 at 18:04 UTC
    <proteins.txt sed 'N;s/\n/ /' | perl -wpe 'print length, " ";' | sort +-nr | sort -suk2,2 | cut -d\ -f2- | sed 's/ /\n/' >output.txt

    This assumes that the protein names don't contain whitespace, that the order of the output doesn't matter, that it doesn't matter which one of two sequences of equal lengths you choose, and it might also need gnu sed. Any of these could be fixed easily.

Re: select strings with biggest length
by talexb (Chancellor) on Nov 19, 2006 at 17:20 UTC

    Sounds like you've got a good handle on the algorithim. What Perl code have you written so far to solve this problem?

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds