Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have an array containing the following 3 elements:
@array = ('>143B_HUMAN (P31946) 14-3-3 protein beta/alpha (Protein kin +ase C inhibitor protein-1) (KCIP-1) (Protein 1054) TMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSW RVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYL KMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYY EILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDA GEGEN', '>AAAT_HUMAN (Q15758) Neutral amino acid transporter B(0) (ATB +(0)) (Sodium-dependent neutral amino acid transporter type 2) (RD114/ +simian type D retrovirus receptor) (Baboon M7 virus receptor) MVADPPRDSKGLAAAEPTANGGLALASIEDQGAAAGGYCGSRDQVRRCLRANLLVLLTVV AVVAGVALGLGVSGAGGALALGPERLSAFVFPGELLLRLLRMIILPLVVCSLIGGAASLD PGALGRLGAWALLFFLVTTLLASALGVGLALALQPGAASAAINASVGAAGSAENAPSKEV LDSFLDLARNIFPSNLVSAAFRSYSTTYEERNITGTRVKVPVGQEVEGMNILGLVVFAIV FGVALRKLGPEGELLIRFFNSFNEATMVLVSWIMWYAPVGIMFLVAGKIVEMEDVGLLFA RLGKYILCCLLGHAIHGLLVLPLIYFLFTRKNPYRFLWGIVTPLATAFGTSSSSATLPLM MKCVEENNGVAKHISRFILPIGATVNMDGAALFQCVAAVFIAQLSQQSLDFVKIITILVT ATASSVGAAGIPAGGVLTLAIILEAVNLPVDHISLILAVDWLVDRSCTVLNVEGDALGAG LLQNYVDRTESRSTEPELIQVKSELPLDPLPVPTEEGNPLLKHYRGPAGDATVASEKESV M', '>143E_HUMAN (P42655) 14-3-3 protein epsilon (Mitochondrial import + stimulation factor L subunit) (Protein kinase C inhibitor protein-1) + (KCIP-1) (14-3-3E) MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ')
For each element, I only want to extract the sequence of capital letters i.e. MERYSHIBDOCLDPKSD..... in each element. I have tried using a foreach loop to implement the substr function on each element, with no luck. Any help would be appreciated. Thanks

Replies are listed 'Best First'.
Re: string manipulation
by robartes (Priest) on Nov 19, 2002 at 11:07 UTC
    Hi,

    you can match any sequence of capital letters with the following simple regexp:

    /[A-Z]+/
    This captures one or more capitals. If you want to access the matched string, put parentheses around the pattern:
    /([A-Z]+)/
    and then the matched string will be in $1.

    Have a look at perlre for more information on regular expressions.

    CU
    Robartes-

Re: string manipulation
by dingus (Friar) on Nov 19, 2002 at 11:46 UTC
    There are a few way to do this. It is not clear to me whether you want the line breaks or not. I assume NOT.
    for (@array) { my $extract = substr($_,rindex($_,')')+1 ); # everything after the l +ast paren is interesting $extract=~ s/\s+//gs; # remove spaces print "$extract\n"; # or do something else }
    Note that there is no check for whether or not there are any )s in the string and skippng over if there are.

    Dingus


    Enter any 47-digit prime number to continue.
Re: string manipulation
by Thelonius (Priest) on Nov 19, 2002 at 14:04 UTC
    for (@array) { if (/([A-Z\n]+)$/) { my $sequence = $1; $sequence =~ s/\n//g; print $sequence, "\n"; # or whatever you want to do with it; } }
Re: string manipulation
by pg (Canon) on Nov 19, 2002 at 15:45 UTC
    I am not trying to be rude and picky, but to be frank, I had to look at the screen really closely to see, where each of the three elements started. I understand that part of the problem is that you have to posted it within this small space. How do you actually store those data in your production code? Did you have a easy to understand, easy to maintain, and visually nice way? Monks, any suggestion? How to write this with some beauty?
      It looks like he's got more than one piece of information per string. Much better to split that out into a hash and have an array of hashes. It's even cooler cause it's now self-documenting. :-)

      I thought the same thing, but then concluded that the data is probably read in from a file rather than hardcoded into a program and was only posted that way for the purposes of asking the question.

      If I really had to embed such long lines of text into a script, I'd probably use something like

      @array = ( '>143B_HUMAN (P31946) 14-3-3 protein beta/alpha ' . '(Protein kinase C inhibitor protein-1) (KCIP-1) (Protein 1054) ' . 'TMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWR' . 'VISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKM' . 'KGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEIL' . 'NSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN', # ... );

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller