need help with a regex

aquinom has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: need help with a regex by kennethk (Abbot) on Oct 22, 2010 at 18:59 UTC
Rather than just the textual description you included above, it is generally much better to include a small sample of your actual input wrapped in code tags. Note that your post has been misformatted as a result. See Writeup Formatting Tips. Also, please provide sample output, as I have a great deal of trouble following your spec. See How do I post a question effectively?. I note that your posted code has several basic syntactic issues. Are you learning Perl from a book, or are you attempting to modify someone else's script? Do you have any background in programming in general? These issues include using a capitalized AND (vs. and, Perl is case sensitive), Modifiers on your regular expressions that are inappropriate for what you are trying to do, and a wholly incorrect block structure. Telling us where you are will let us better guide your development as a programmer and point you to more useful resources. I am loathe to leave this node without a concrete bit of suggested code, but I am at a loss for how to even modify the posted code to "work".	[reply]
Re^2: need help with a regex by aquinom (Sexton) on Oct 22, 2010 at 20:10 UTC
Sorry about the messy code I wrote I know it wasn't actually legit I was just in a hurry and trying to get the basic idea across. Here's a snip from what I'm parsing: >P30450 \| Homo sapiens (Human). \| NCBI_TaxID=9606; \| 365 \| Name=HLA-A; Synonyms=HLAA;M MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDT QFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGS TIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKW ETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATL RCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQ RYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKS SDRKGGSYSQAASSDSAQGSDMSLTACKV and the output should look like: Hydrophobic stretch found in: P30450 \| Homo sapiens (Human). \| NCBI_TaxID=9606; \| 365 \| Name=HLA-A; Synonyms=HLAA; AVVAAVMW The match was at position: 325 Hydrophobic stretch found in: A7MBM2 \| Homo sapiens (Human). \| NCBI_TaxID=9606; \| 1401 \| Name=DISP2; Synonyms=DISPB, KIAA1742; VAVLMLCLAVIFLC The match was at potistion: 170 LLALVAIFF The match was at potistion: 493 IWICWFAALAA The match was at potistion: 705 LALALAFA The match was at potistion: 970 Hydrophobic region(s) found in 2 sequences out of 15 sequences	[reply]
Re^3: need help with a regex by kennethk (Abbot) on Oct 22, 2010 at 20:57 UTC
Like I said, where are the `<code>` tags around your input and output? Embedding text in HTML is notorious for changing formatting. A regular expression I write will very likely fail because the character sequence displayed on the screen will differ from what you have in your file. I also note your "desired output" is significantly different from what you specified in the original post. For example, the word "Hydrophobic" appears nowhere the OP and the word "contains" appears nowhere in the new spec. Sorry about the messy code I wrote I know it wasn't actually legit Writing pseudocode is considered good practice when you don't know a language. That means explain clearly what you want an algorithm to do, not just posting gibberish from the target language. I was just in a hurry and trying to get the basic idea across Which you did not do, nor have you done effectively yet. Perhaps the more verbose How To Ask Questions The Smart Way may provide clear guidance on how to effective construct questions on internet forums. You still did not answer my questions on your own experience level. I will assume you are an extreme novice with access to a working script crafted by another. I can give you aid on this particular problem, but if you expect to get anywhere in the long run, you will need to learn some very basic coding concepts you apparently lack. In examining your desired output, I note that several of your character sequences do not appear in your text block, e.g. "VAVLMLCLAVIFLC", "LLALVAIFF", ... I note that "AVVAAVMW" is cited at "position: 325". This makes me suspect that the orginal file you are parsing does not contain the white space you are posting or modifies the input before filtering. I have modified your originally posted code to do something like what you request, though the numbers are wrong. #!/usr/bin/perl use strict; use warnings; local $/; # Slurp my $content = <DATA>; my ($header) = $content =~ /^(>.*?)$/m; while ($content =~ /^[\w]+?([VMFWLCA]{8,})[\w]+?$/mg) { my $sequence = $1; print $header, "contains $sequence at position ", pos($content) - +length($sequence), "\n"; } __DATA__ >P30450 \| Homo sapiens (Human). \| NCBI_TaxID=9606; \| 365 \| Name=HLA-A; + Synonyms=HLAA;M MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDT QFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGS TIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKW ETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATL RCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQ RYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKS SDRKGGSYSQAASSDSAQGSDMSLTACKV [download] outputs `>P30450 \| Homo sapiens (Human). \| NCBI_TaxID=9606; \| 365 \| Name=HLA-A; Synonyms=HLAA;Mcontains AVVAAVMW at position 420` I leave modifying it to get what you expect as an exercise for you. You will likely want to read the documentation at perlsyn, perlre, perlretut, pos and length.	[reply] [d/l] [select]
Re^4: need help with a regex by jwkrahn (Abbot) on Oct 22, 2010 at 21:33 UTC
Re^5: need help with a regex by kennethk (Abbot) on Oct 22, 2010 at 21:36 UTC
Re^4: need help with a regex by aquinom (Sexton) on Oct 22, 2010 at 21:49 UTC
Re^5: need help with a regex by kennethk (Abbot) on Oct 22, 2010 at 22:06 UTC
Some notes below your chosen depth have not been shown here
Re^4: need help with a regex by aquinom (Sexton) on Oct 22, 2010 at 21:17 UTC
Re^5: need help with a regex by kennethk (Abbot) on Oct 22, 2010 at 21:28 UTC
Re: need help with a regex by halfcountplus (Hermit) on Oct 22, 2010 at 19:12 UTC
I think you will get much better help if you are more specific and include a actual example of what you want want to parse -- since you have admitted your regexp doesn't work, it is hard to deduce what you are trying to do. Eg, `[VMFWLCA]{,8}` is invalid (there must something before the comma), but presuming you mean "at least once, and a maximum of 8 times" (which would be {1,8}) then $1 could be any of these: VVVVV FMFMWWWL C Is that what you are trying to match?	[reply] [d/l]
Re^2: need help with a regex by aquinom (Sexton) on Oct 22, 2010 at 20:34 UTC
{,8} is perfectly valid it means a maximum of 8 times, but I meant to write {8,} anyways which means at least 8 times.	[reply]
Re^3: need help with a regex by jwkrahn (Abbot) on Oct 22, 2010 at 21:03 UTC
`{,8}` is perfectly valid it means a maximum of 8 times `{,8}` is a perfectly valid string which in a regular expression matches the character '{' followed by the character ',' followed by the character '8' followed by the character '}'. Perhaps you are thinking of the quantifier `{0,8}`?	[reply] [d/l] [select]
Re^4: need help with a regex by aquinom (Sexton) on Oct 22, 2010 at 21:28 UTC
Re^5: need help with a regex by johngg (Canon) on Oct 22, 2010 at 22:08 UTC
Re: need help with a regex by umasuresh (Hermit) on Oct 22, 2010 at 19:04 UTC
This has been addressed in the past -> Parse Fasta Format UPDATE ~~I can't find the original post!!!!~~ Here is the original post: http://www.perlmonks.com/?node_id=308170	[reply]