Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I have this part of a file that I want to match:
CDS join(2432..2501,5144..5154,5746..5760,6411..6446, 7558..7650,8929..8982,11919..11963,12056..12109, 12202..12255,12562..12615,13036..13089,13613..136 +66, 15217..15261,15553..15606,15706..15750,16140..161 +93, 16692..16790,16934..16978,17093..17191,17612..176 +65, 17791..17898,18259..18312,18426..18524,19436..194 +89, 19953..20051,20452..20505,21059..21112,21263..213 +16, 21590..21643,22596..22640,23773..23871,25090..251 +97, 25867..25920,26854..26907,27588..27641,27746..277 +99, 27896..28003,28365..28418,29217..29270,30273..304 +34, 31647..31754,32427..32534,32920..32973,33060..331 +67, 33308..33361,33733..33840,34316..34369,34496..346 +03, 34936..35194,35602..35786,36498..36740,37554..377 +00) /gene="COL1A2" /codon_start=1 /product="pro-alpha 2(I) collagen" /protein_id="AAB93981.1" /db_xref="GI:2735715" /translation="MLSFVDTRTLLLLAVTLCLATCQSLQEETVRKGPA +GDRGPRGER GPPGPPGRDGEDGPTGPPGPPGPPGPPGLGGNFAAQYDGKGVGLGPGPM +GLMGPRGPP GAAGAPGPQGFQGPAGEPGEPGQTGPAGARGPAGPPGKAGEDGHPGKPG +RPGERGVVG PQGARGFPGTPGLPGFKGIRGHNGLDGLKGQPGAPGVKGEPGAPGENGT +PGQTGARGL PGERGRVGAPGPAGARGSDGSVGPVGPAGPIGSAGPPGFPGAPGPKGEI +GAIGNAGPA GPAGPRGEVGLPGLSGPVGPPGNPGANGLTGAKGAAGLPGVAGAPGLPG +PRGIPGPVG AAGATGARGLVGEPGPAGSKGESGNKGEPGSAGPQGPPGPSGEEGKRGP +NGEAGSAGP PGPPGLRGSPGSRGLPGADGRAGVMGPPGSRGASGPAGVRGPNGDAGRP +GEPGLMGPR GLPGSPGNIGPAGKEGPVGLPGIDGRPGPIGPVGARGEPGNIGFPGPKG +PTGDPGKNG DKGHAGLAGARGAPGPDGNNGAQGPPGPQGVQGGKGEQGPAGPPGFQGL +PGPSGPAGE VGKPGERGLHGEFGLPGPAGPRGERGPPGESGAAGPTGPIGSRGPSGPP +GPDGNKGEP GVVGAVGTAGPSGPSGLPGERGAAGIPGGKGEKGEPGLRGEIGNPGRDG +ARGAHGAVG APGPAGATGDRGEAGAAGPAGPAGPRGSPGERGEVGPAGPNGFAGPAGA +AGQPGAKGE RGGKGPKGENGVVGPTGPVGAAGPAGPNGPPGPAGSRGDGGPPGMTGFP +GAAGRTGPP GPSGISGPPGPPGPAGKEGLRGPRGDQGPVGRTGEVGAVGPPGFAGEKG +PSGEAGTAG PPGTPGPQGLLGAPGILGLPGSRGERGLPGVAGAVGEPGPLGIAGPPGA +RGPPGAVGS PGVNGAPGEAGRDGNPGNDGPPGRDGQPGHKGERGYPGNIGPVGAAGAP +GPHGPVGPA GKHGNRGETGPSGPVGPAGAVGPRGPSGPQGIRGDKGEPGEKGPRGLPG +FKGHNGLQG LPGIAGHHGDQGAPGSVGPAGPRGPAGPSGPAGKDGRTGHPGTVGPAGI +RGPQGHQGP AGPPGPPGPPGPPGVSGGGYDFGYDGDFYRADQPRSAPSLRPKDYEVDA +TLKSLNNQI ETLLTPEGSRKNPARTCRDLRLSHPEWSSGYYWIDPNQGCTMEAIKVYC +DFPTGETCI RAQPENIPAKNWYRSSKDKKHVWLGETINAGSQFEYNVEGVTSKEMATQ +LAFMRLLAN YASQNITYHCKNSIAYMDEETGNLKKAVILQGSNDVELVAEGNSRFTYT +VLVDGCSKK TNEWGKTIIEYKTNKPSRLPFLDIAPLDIGGADHEFFVDIGPVCFK" exon 2432..2501 /gene="COL1A2" /number=1 protein_bind 2487..2500 /gene="COL1A2" /note="putative" /citation=[7] /bound_moiety="NF1" intron 2502..5143 /gene="COL1A2" /citation=[10] /citation=[7] /number=1 protein_bind 3380..3386 /note="putative; bottom strand" /bound_moiety="AP1" protein_bind 3407..3413 /gene="COL1A2" /note="putative" /citation=[7] /bound_moiety="AP1" repeat_region 3716..3747 /citation=[7] /rpt_type=tandem /rpt_unit="gt" exon 5144..5154 /gene="COL1A2" /number=2 intron 5155..5745 /gene="COL1A2" /citation=[10] /number=2 exon 5746..5760 /gene="COL1A2" /number=3

Specifically, I want to match the part starting from /translation=" until first occurence of exon, that is, the amino acid sequence. I tried the following but can't get anything...
if($line7=~/^\s+\/translation\=\"(.*)exon/gs) { $amino_acid_seq=$1; } print $amino_acid_seq."\n";

If I just put /s, I get only the first line of amino-acids..

Replies are listed 'Best First'.
Re: Help to build a REGEXP
by kcott (Archbishop) on Mar 12, 2014 at 00:08 UTC

    I'm assuming $line7 contains the excessive amount of data that you've posted. In the script below, I've used a representative sample. For future posts, please do the same.

    You haven't shown how you've extracted that data. Ensure $line7 actually contains the data you think it does (i.e. print "$line7\n";).

    In the script below, I've simply captured everything that isn't a double-quote between '/translation="' and '"' then removed all the extraneous whitespace.

    #!/usr/bin/env perl -l use strict; use warnings; my $line7 = ' ... /db_xref="GI:2735715" /translation="MLSFVDTRTLLLLAVTLCLATCQSLQEETVRKGPA +GDRGPRGER GPPGPPGRDGEDGPTGPPGPPGPPGPPGLGGNFAAQYDGKGVGLGPGPM +GLMGPRGPP YASQNITYHCKNSIAYMDEETGNLKKAVILQGSNDVELVAEGNSRFTYT +VLVDGCSKK TNEWGKTIIEYKTNKPSRLPFLDIAPLDIGGADHEFFVDIGPVCFK" exon 2432..2501 ... '; my $re = qr{/translation="([^"]+)"}; my ($extract) = $line7 =~ $re; $extract =~ s/\s+//g; print $extract;

    Output:

    MLSFVDTRTLLLLAVTLCLATCQSLQEETVRKGPAGDRGPRGERGPPGPPGRDGEDGPTGPPGPPGPPGP +PGLGGNFAAQYDGKGVGLGPGPMGLMGPRGPPYASQNITYHCKNSIAYMDEETGNLKKAVILQGSNDVE +LVAEGNSRFTYTVLVDGCSKKTNEWGKTIIEYKTNKPSRLPFLDIAPLDIGGADHEFFVDIGPVCFK

    Update: From looking at other posts in this thread, it would seem possible that your initial problem (i.e. before you even start performing any matching) could be extracting the data you want. If that's the case, open a filehandle to your data file and populate $line7 as I've shown below. As you'll see, once you've done that, the rest of the code hasn't changed and the output is identical.

    By the way, is there some significance to the $line7 variable name? If not, I'd pick something more meaningful.

    #!/usr/bin/env perl -l use strict; use warnings; my $line7 = ''; my $re = qr{/translation="([^"]+)"}; while (<DATA>) { if (/^\s+\/translation=/ .. /^\s+exon/) { $line7 .= $_; } else { $line7 ? last : next; } } my ($extract) = $line7 =~ $re; $extract =~ s/\s+//g; print $extract; __DATA__ ... /db_xref="GI:2735715" /translation="MLSFVDTRTLLLLAVTLCLATCQSLQEETVRKGPA +GDRGPRGER GPPGPPGRDGEDGPTGPPGPPGPPGPPGLGGNFAAQYDGKGVGLGPGPM +GLMGPRGPP YASQNITYHCKNSIAYMDEETGNLKKAVILQGSNDVELVAEGNSRFTYT +VLVDGCSKK TNEWGKTIIEYKTNKPSRLPFLDIAPLDIGGADHEFFVDIGPVCFK" exon 2432..2501 ...

    Output:

    MLSFVDTRTLLLLAVTLCLATCQSLQEETVRKGPAGDRGPRGERGPPGPPGRDGEDGPTGPPGPPGPPGP +PGLGGNFAAQYDGKGVGLGPGPMGLMGPRGPPYASQNITYHCKNSIAYMDEETGNLKKAVILQGSNDVE +LVAEGNSRFTYTVLVDGCSKKTNEWGKTIIEYKTNKPSRLPFLDIAPLDIGGADHEFFVDIGPVCFK

    -- Ken

      Hi, I tried:
      if($_=~ m/^\s+\/translation\=\"(.*?)\"/ms) { $wanted_part=$1; }

      but got nothing! But why doesn't it work?

        It doesn't work because your regex doesn't match whatever is in $_. Of course, as you've refused to advise us what $_ contains, you can't possibly expect any further information on what was happening in that isolated code fragment.

        I provided you with a solution. Your response says you tried something completely different. Why did you reply to my post telling me that?

        Did you try my solution? Did it do what you wanted? If not, what did it do differently? Was it unsuitable for your class exercise? If so, in what way was it unsuitable?

        You've failed to tell us what data you're actually trying to match against: first with $line7 and more recently with $_. Why? I even gave you the specific code in my earlier response " (i.e. print "$line7\n";)". Did you do this? If you did, what was the output? If you didn't, why not?

        You've received a lot of advice from people who've freely given their time to try to help you. I think its about time you put in some effort yourself: answer questions, provide output, try solutions and so on.

        -- Ken

Re: Help to build a REGEXP (BioPerl)
by Anonymous Monk on Mar 11, 2014 at 23:39 UTC

    Why don't you use one of the BioPerl modules for reading that aminofastawhateveritis?

    Then you don't need to build a regex to parse the mystery $line7 variable which probably only contains one single line so thats all that is returned because the other lines aren't in the variable ...

    :)

      It's supposed to be for an assignment and we must use REGEXPS...
      Sequence as you can see is spread over multiple lines, that's why I tried to catch everything from </code>/translation</code> all the way until first occurence of exon in the file....

        It's supposed to be for an assignment and we must use REGEXPS...

        That's akin to being asked to do a gainer off a diving board when just learning to swim. Especially so if you're in bioinformatics. From my experience, it would be more pedagogically sound to first learn to proficiently wield the (BioPerl) tools, then learn how to forge such tools...

        If you must, however, use a regex in your script, perhaps the following will be helpful:

        use strict; use warnings; use Bio::SeqIO; my $filename = 'sequences.gen'; my $stream = Bio::SeqIO->new( -file => $filename, -format => 'GenBank' ); while ( my $seq = $stream->next_seq() ) { my $trans = $seq->translate(); print $trans->seq(), "\n"; } my $string = 'This script uses a regex.'; $string =~ s/uses/doesn't use/; print $string;
        Still, this doesn't seem to work...
        if($line7=~/^\s+\/translation\=\"(.*?)\"/s) {$amino_acid_seq=$1;}

      Also, once you get more than one line into $line7, you want non-greedy matching .*? as there are multiple "exon" strings

      also, you don't want to use m//g in scalar context

      Also, perlrequick is a great quick reference :)