Help to build a REGEXP

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I have this part of a file that I want to match:

     CDS             join(2432..2501,5144..5154,5746..5760,6411..6446,
                     7558..7650,8929..8982,11919..11963,12056..12109,
                     12202..12255,12562..12615,13036..13089,13613..136
+66,
                     15217..15261,15553..15606,15706..15750,16140..161
+93,
                     16692..16790,16934..16978,17093..17191,17612..176
+65,
                     17791..17898,18259..18312,18426..18524,19436..194
+89,
                     19953..20051,20452..20505,21059..21112,21263..213
+16,
                     21590..21643,22596..22640,23773..23871,25090..251
+97,
                     25867..25920,26854..26907,27588..27641,27746..277
+99,
                     27896..28003,28365..28418,29217..29270,30273..304
+34,
                     31647..31754,32427..32534,32920..32973,33060..331
+67,
                     33308..33361,33733..33840,34316..34369,34496..346
+03,
                     34936..35194,35602..35786,36498..36740,37554..377
+00)
                     /gene="COL1A2"
                     /codon_start=1
                     /product="pro-alpha 2(I) collagen"
                     /protein_id="AAB93981.1"
                     /db_xref="GI:2735715"
                     /translation="MLSFVDTRTLLLLAVTLCLATCQSLQEETVRKGPA
+GDRGPRGER
                     GPPGPPGRDGEDGPTGPPGPPGPPGPPGLGGNFAAQYDGKGVGLGPGPM
+GLMGPRGPP
                     GAAGAPGPQGFQGPAGEPGEPGQTGPAGARGPAGPPGKAGEDGHPGKPG
+RPGERGVVG
                     PQGARGFPGTPGLPGFKGIRGHNGLDGLKGQPGAPGVKGEPGAPGENGT
+PGQTGARGL
                     PGERGRVGAPGPAGARGSDGSVGPVGPAGPIGSAGPPGFPGAPGPKGEI
+GAIGNAGPA
                     GPAGPRGEVGLPGLSGPVGPPGNPGANGLTGAKGAAGLPGVAGAPGLPG
+PRGIPGPVG
                     AAGATGARGLVGEPGPAGSKGESGNKGEPGSAGPQGPPGPSGEEGKRGP
+NGEAGSAGP
                     PGPPGLRGSPGSRGLPGADGRAGVMGPPGSRGASGPAGVRGPNGDAGRP
+GEPGLMGPR
                     GLPGSPGNIGPAGKEGPVGLPGIDGRPGPIGPVGARGEPGNIGFPGPKG
+PTGDPGKNG
                     DKGHAGLAGARGAPGPDGNNGAQGPPGPQGVQGGKGEQGPAGPPGFQGL
+PGPSGPAGE
                     VGKPGERGLHGEFGLPGPAGPRGERGPPGESGAAGPTGPIGSRGPSGPP
+GPDGNKGEP
                     GVVGAVGTAGPSGPSGLPGERGAAGIPGGKGEKGEPGLRGEIGNPGRDG
+ARGAHGAVG
                     APGPAGATGDRGEAGAAGPAGPAGPRGSPGERGEVGPAGPNGFAGPAGA
+AGQPGAKGE
                     RGGKGPKGENGVVGPTGPVGAAGPAGPNGPPGPAGSRGDGGPPGMTGFP
+GAAGRTGPP
                     GPSGISGPPGPPGPAGKEGLRGPRGDQGPVGRTGEVGAVGPPGFAGEKG
+PSGEAGTAG
                     PPGTPGPQGLLGAPGILGLPGSRGERGLPGVAGAVGEPGPLGIAGPPGA
+RGPPGAVGS
                     PGVNGAPGEAGRDGNPGNDGPPGRDGQPGHKGERGYPGNIGPVGAAGAP
+GPHGPVGPA
                     GKHGNRGETGPSGPVGPAGAVGPRGPSGPQGIRGDKGEPGEKGPRGLPG
+FKGHNGLQG
                     LPGIAGHHGDQGAPGSVGPAGPRGPAGPSGPAGKDGRTGHPGTVGPAGI
+RGPQGHQGP
                     AGPPGPPGPPGPPGVSGGGYDFGYDGDFYRADQPRSAPSLRPKDYEVDA
+TLKSLNNQI
                     ETLLTPEGSRKNPARTCRDLRLSHPEWSSGYYWIDPNQGCTMEAIKVYC
+DFPTGETCI
                     RAQPENIPAKNWYRSSKDKKHVWLGETINAGSQFEYNVEGVTSKEMATQ
+LAFMRLLAN
                     YASQNITYHCKNSIAYMDEETGNLKKAVILQGSNDVELVAEGNSRFTYT
+VLVDGCSKK
                     TNEWGKTIIEYKTNKPSRLPFLDIAPLDIGGADHEFFVDIGPVCFK"
     exon            2432..2501
                     /gene="COL1A2"
                     /number=1
     protein_bind    2487..2500
                     /gene="COL1A2"
                     /note="putative"
                     /citation=[7]
                     /bound_moiety="NF1"
     intron          2502..5143
                     /gene="COL1A2"
                     /citation=[10]
                     /citation=[7]
                     /number=1
     protein_bind    3380..3386
                     /note="putative; bottom strand"
                     /bound_moiety="AP1"
     protein_bind    3407..3413
                     /gene="COL1A2"
                     /note="putative"
                     /citation=[7]
                     /bound_moiety="AP1"
     repeat_region   3716..3747
                     /citation=[7]
                     /rpt_type=tandem
                     /rpt_unit="gt"
     exon            5144..5154
                     /gene="COL1A2"
                     /number=2
     intron          5155..5745
                     /gene="COL1A2"
                     /citation=[10]
                     /number=2
     exon            5746..5760
                     /gene="COL1A2"
                     /number=3
[download]

Specifically, I want to match the part starting from /translation=" until first occurence of exon, that is, the amino acid sequence. I tried the following but can't get anything...

if($line7=~/^\s+\/translation\=\"(.*)exon/gs)
{
    $amino_acid_seq=$1;
}
print $amino_acid_seq."\n";
[download]

If I just put /s, I get only the first line of amino-acids..

Comment on Help to build a REGEXP Select or Download Code

Replies are listed 'Best First'.
Re: Help to build a REGEXP by kcott (Archbishop) on Mar 12, 2014 at 00:08 UTC
I'm assuming `$line7` contains the excessive amount of data that you've posted. In the script below, I've used a representative sample. For future posts, please do the same. You haven't shown how you've extracted that data. Ensure `$line7` actually contains the data you think it does (i.e. `print "$line7\n";`). In the script below, I've simply captured everything that isn't a double-quote between '`/translation="`' and '`"`' then removed all the extraneous whitespace. `#!/usr/bin/env perl -l use strict; use warnings; my $line7 = ' ... /db_xref="GI:2735715" /translation="MLSFVDTRTLLLLAVTLCLATCQSLQEETVRKGPA +GDRGPRGER GPPGPPGRDGEDGPTGPPGPPGPPGPPGLGGNFAAQYDGKGVGLGPGPM +GLMGPRGPP YASQNITYHCKNSIAYMDEETGNLKKAVILQGSNDVELVAEGNSRFTYT +VLVDGCSKK TNEWGKTIIEYKTNKPSRLPFLDIAPLDIGGADHEFFVDIGPVCFK" exon 2432..2501 ... '; my $re = qr{/translation="([^"]+)"}; my ($extract) = $line7 =~ $re; $extract =~ s/\s+//g; print $extract;` [download] Output: `MLSFVDTRTLLLLAVTLCLATCQSLQEETVRKGPAGDRGPRGERGPPGPPGRDGEDGPTGPPGPPGPPGP +PGLGGNFAAQYDGKGVGLGPGPMGLMGPRGPPYASQNITYHCKNSIAYMDEETGNLKKAVILQGSNDVE +LVAEGNSRFTYTVLVDGCSKKTNEWGKTIIEYKTNKPSRLPFLDIAPLDIGGADHEFFVDIGPVCFK` [download] Update: From looking at other posts in this thread, it would seem possible that your initial problem (i.e. before you even start performing any matching) could be extracting the data you want. If that's the case, open a filehandle to your data file and populate `$line7` as I've shown below. As you'll see, once you've done that, the rest of the code hasn't changed and the output is identical. By the way, is there some significance to the `$line7` variable name? If not, I'd pick something more meaningful. #!/usr/bin/env perl -l use strict; use warnings; my $line7 = ''; my $re = qr{/translation="([^"]+)"}; while (<DATA>) { if (/^\s+\/translation=/ .. /^\s+exon/) { $line7 .= $_; } else { $line7 ? last : next; } } my ($extract) = $line7 =~ $re; $extract =~ s/\s+//g; print $extract; __DATA__ ... /db_xref="GI:2735715" /translation="MLSFVDTRTLLLLAVTLCLATCQSLQEETVRKGPA +GDRGPRGER GPPGPPGRDGEDGPTGPPGPPGPPGPPGLGGNFAAQYDGKGVGLGPGPM +GLMGPRGPP YASQNITYHCKNSIAYMDEETGNLKKAVILQGSNDVELVAEGNSRFTYT +VLVDGCSKK TNEWGKTIIEYKTNKPSRLPFLDIAPLDIGGADHEFFVDIGPVCFK" exon 2432..2501 ... [download] Output: `MLSFVDTRTLLLLAVTLCLATCQSLQEETVRKGPAGDRGPRGERGPPGPPGRDGEDGPTGPPGPPGPPGP +PGLGGNFAAQYDGKGVGLGPGPMGLMGPRGPPYASQNITYHCKNSIAYMDEETGNLKKAVILQGSNDVE +LVAEGNSRFTYTVLVDGCSKKTNEWGKTIIEYKTNKPSRLPFLDIAPLDIGGADHEFFVDIGPVCFK` [download] -- Ken	[reply] [d/l] [select]
Re^2: Help to build a REGEXP by Anonymous Monk on Mar 12, 2014 at 10:30 UTC
Hi, I tried: `if($_=~ m/^\s+\/translation\=\"(.*?)\"/ms) { $wanted_part=$1; }` [download] but got nothing! But why doesn't it work?	[reply] [d/l]
Re^3: Help to build a REGEXP by Anonymous Monk on Mar 12, 2014 at 10:57 UTC
What is inside $_? Did you ddumper^{Basic debugging checklist} it to see whats inside, like Basic debugging checklist item 4 teache? Did you `use re 'debug';` to see what gets matched? If you look at Re^4: Help to build a REGEXP (m//ms) you can see me show you all these things	[reply] [d/l]
Re^4: Help to build a REGEXP by Anonymous Monk on Mar 12, 2014 at 11:15 UTC
Re^5: Help to build a REGEXP by Anonymous Monk on Mar 12, 2014 at 11:46 UTC
Re^3: Help to build a REGEXP by kcott (Archbishop) on Mar 12, 2014 at 20:18 UTC
It doesn't work because your regex doesn't match whatever is in `$_`. Of course, as you've refused to advise us what `$_` contains, you can't possibly expect any further information on what was happening in that isolated code fragment. I provided you with a solution. Your response says you tried something completely different. Why did you reply to my post telling me that? Did you try my solution? Did it do what you wanted? If not, what did it do differently? Was it unsuitable for your class exercise? If so, in what way was it unsuitable? You've failed to tell us what data you're actually trying to match against: first with `$line7` and more recently with `$_`. Why? I even gave you the specific code in my earlier response " (i.e. `print "$line7\n";`)". Did you do this? If you did, what was the output? If you didn't, why not? You've received a lot of advice from people who've freely given their time to try to help you. I think its about time you put in some effort yourself: answer questions, provide output, try solutions and so on. -- Ken	[reply] [d/l] [select]
Re: Help to build a REGEXP (BioPerl) by Anonymous Monk on Mar 11, 2014 at 23:39 UTC
Why don't you use one of the BioPerl modules for reading that aminofastawhateveritis? Then you don't need to build a regex to parse the mystery $line7 variable which probably only contains one single line so thats all that is returned because the other lines aren't in the variable ... :)	[reply]
Re^2: Help to build a REGEXP (BioPerl) by Anonymous Monk on Mar 11, 2014 at 23:43 UTC
It's supposed to be for an assignment and we must use REGEXPS... Sequence as you can see is spread over multiple lines, that's why I tried to catch everything from </code>/translation</code> all the way until first occurence of `exon` in the file....	[reply] [d/l]
Re^3: Help to build a REGEXP (BioPerl) by Kenosis (Priest) on Mar 11, 2014 at 23:59 UTC
It's supposed to be for an assignment and we must use REGEXPS... That's akin to being asked to do a gainer off a diving board when just learning to swim. Especially so if you're in bioinformatics. From my experience, it would be more pedagogically sound to first learn to proficiently wield the (BioPerl) tools, then learn how to forge such tools... If you must, however, use a regex in your script, perhaps the following will be helpful: `use strict; use warnings; use Bio::SeqIO; my $filename = 'sequences.gen'; my $stream = Bio::SeqIO->new( -file => $filename, -format => 'GenBank' ); while ( my $seq = $stream->next_seq() ) { my $trans = $seq->translate(); print $trans->seq(), "\n"; } my $string = 'This script uses a regex.'; $string =~ s/uses/doesn't use/; print $string;` [download]	[reply] [d/l]
Re^4: Help to build a REGEXP (BioPerl) by erix (Prior) on Mar 12, 2014 at 00:20 UTC
Re^5: Help to build a REGEXP (BioPerl) by erix (Prior) on Mar 12, 2014 at 00:42 UTC
Some notes below your chosen depth have not been shown here
Re^5: Help to build a REGEXP (BioPerl) by Kenosis (Priest) on Mar 12, 2014 at 00:32 UTC
Re^3: Help to build a REGEXP (BioPerl) by Anonymous Monk on Mar 11, 2014 at 23:47 UTC
Still, this doesn't seem to work... `if($line7=~/^\s+\/translation\=\"(.*?)\"/s) {$amino_acid_seq=$1;}` [download]	[reply] [d/l]
Re^4: Help to build a REGEXP (m//ms) by Anonymous Monk on Mar 12, 2014 at 00:21 UTC
Re^2: Help to build a REGEXP by Anonymous Monk on Mar 11, 2014 at 23:43 UTC
Also, once you get more than one line into $line7, you want non-greedy matching `.*?` as there are multiple "exon" strings also, you don't want to use m//g in scalar context Also, perlrequick is a great quick reference :)	[reply] [d/l]