in reply to Re: need help with a regex
in thread need help with a regex

Sorry about the messy code I wrote I know it wasn't actually legit I was just in a hurry and trying to get the basic idea across. Here's a snip from what I'm parsing:

>P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;M

MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDT
QFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGS
TIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKW
ETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATL
RCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQ
RYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKS
SDRKGGSYSQAASSDSAQGSDMSLTACKV

and the output should look like:
Hydrophobic stretch found in: P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
AVVAAVMW
The match was at position: 325
Hydrophobic stretch found in:
A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
VAVLMLCLAVIFLC
The match was at potistion: 170
LLALVAIFF
The match was at potistion: 493
IWICWFAALAA
The match was at potistion: 705
LALALAFA
The match was at potistion: 970
Hydrophobic region(s) found in 2 sequences out of 15 sequences

Replies are listed 'Best First'.
Re^3: need help with a regex
by kennethk (Abbot) on Oct 22, 2010 at 20:57 UTC
    Like I said, where are the <code> tags around your input and output? Embedding text in HTML is notorious for changing formatting. A regular expression I write will very likely fail because the character sequence displayed on the screen will differ from what you have in your file. I also note your "desired output" is significantly different from what you specified in the original post. For example, the word "Hydrophobic" appears nowhere the OP and the word "contains" appears nowhere in the new spec.

    Sorry about the messy code I wrote I know it wasn't actually legit
    Writing pseudocode is considered good practice when you don't know a language. That means explain clearly what you want an algorithm to do, not just posting gibberish from the target language.

    I was just in a hurry and trying to get the basic idea across
    Which you did not do, nor have you done effectively yet. Perhaps the more verbose How To Ask Questions The Smart Way may provide clear guidance on how to effective construct questions on internet forums.

    You still did not answer my questions on your own experience level. I will assume you are an extreme novice with access to a working script crafted by another. I can give you aid on this particular problem, but if you expect to get anywhere in the long run, you will need to learn some very basic coding concepts you apparently lack.

    In examining your desired output, I note that several of your character sequences do not appear in your text block, e.g. "VAVLMLCLAVIFLC", "LLALVAIFF", ... I note that "AVVAAVMW" is cited at "position: 325". This makes me suspect that the orginal file you are parsing does not contain the white space you are posting or modifies the input before filtering.

    I have modified your originally posted code to do something like what you request, though the numbers are wrong.

    #!/usr/bin/perl use strict; use warnings; local $/; # Slurp my $content = <DATA>; my ($header) = $content =~ /^(>.*?)$/m; while ($content =~ /^[\w]+?([VMFWLCA]{8,})[\w]+?$/mg) { my $sequence = $1; print $header, "contains $sequence at position ", pos($content) - +length($sequence), "\n"; } __DATA__ >P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; + Synonyms=HLAA;M MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDT QFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGS TIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKW ETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATL RCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQ RYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKS SDRKGGSYSQAASSDSAQGSDMSLTACKV

    outputs

    >P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;Mcontains AVVAAVMW at position 420

    I leave modifying it to get what you expect as an exercise for you. You will likely want to read the documentation at perlsyn, perlre, perlretut, pos and length.

      while ($content =~ /^[\w]+?([VMFWLCA]{8,})[\w]+?$/mg) { my $sequence = $1; print $header, "contains $sequence at position ", pos($content) - +length($sequence), "\n"; }

      What did you want "pos($content) - length($sequence)" to give you?    The whole pattern starts at 371 and ends at 428 while the contents of $1 start at 416 and end at 424 so 420 is somewhere in the middle of $1.    Have a look at the @- and @+ arrays for the start and end positions of matches.

        The logic was naively copied from the OP. To say it was buggy does not surprise me, though I probably should have tried to divine intent and test appropriately.
      Heh that's actually pretty similar to what I had originally coded and contains the same problem, the problem is 2 fold: for some reason /g isn't resetting the counter for each new line it goes through looking for the regex so I'm getting position numbers in the 10,000s, and 2: if I simply delete the header after the first print statement, the code isn't iterating over each header line and priting the next one that contains the sequence match/s in the next line. I tried doing a while loop over the text file looking at each header and storing it then doing a while loop over the text looking for the sequence pattern and storing it. I was thinking about then saying if the sequence was found (defined $1) print the header and the sequence but I can't figure out how to associate the two (a more complete text file to use from:)
      >P31946 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 | Name=YWH +AB; MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQK +TERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDN +KQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTL +NEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN >P62258 | Homo sapiens (Human). | NCBI_TaxID=9606; | 255 | Name=YWH +AE; MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKE +ENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGN +DRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDT +LSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ >Q04917 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 | Name=YWH +AH; Synonyms=YWHA1; MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKT +MADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVAS +GEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAEL +DTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN >P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA +-A; Synonyms=HLAA; MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRME +PRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGSHTIQRMYGCDVGPDGRFLRGYQQ +DAYDGKDYIALNEDLRSWTAADMAAQITQRKWETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDA +PKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQ +EQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKSSDRKGGSYSQ +AASSDSAQGSDMSLTACKV >Q156A1 | Homo sapiens (Human). | NCBI_TaxID=9606; | 80 | Name=ATXN +8; MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ +QQQQQQQQQQ >Q9UQB9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 309 | Name=AUR +KC; Synonyms=AIE2, AIK3, ARK3, STK13; MSSPRAVVQLGKAQPAGEELATANQTAQQPSSPAMRRLTVDDFEIGRPLGKGKFGNVYLARLKESHFIVA +LKVLFKSQIEKEGLEHQLRREIEIQAHLQHPNILRLYNYFHDARRVYLILEYAPRGELYKELQKSEKLD +EQRTATIIEELADALTYCHDKKVIHRDIKPENLLLGFRGEVKIADFGWSVHTPSLRRKTMCGTLDYLPP +EMIEGRTYDEKVDLWCIGVLCYELLVGYPPFESASHSETYRRILKVDVRFPLSMPLGARDLISRLLRYQ +PLERLPLAQILKHPWVQAHSRRVLPPCAQMAS >O75366 | Homo sapiens (Human). | NCBI_TaxID=9606; | 819 | Name=AVI +L; MPLTSAFRAVDNDPGIIVWRIEKMELALVPVSAHGNFYEGDCYVILSTRRVASLLSQDIHFWIGKDSSQD +EQSCAAIYTTQLDDYLGGSPVQHREVQYHESDTFRGYFKQGIIYKQGGVASGMKHVETNTYDVKRLLHV +KGKRNIRATEVEMSWDSFNRGDVFLLDLGKVIIQWNGPESNSGERLKAMLLAKDIRDRERGGRAEIGVI +EGDKEAASPELMKVLQDTLGRRSIIKPTVPDEIIDQKQKSTIMLYHISDSAGQLAVTEVATRPLVQDLL +NHDDCYILDQSGTKIYVWKGKGATKAEKQAAMSKALGFIKMKSYPSSTNVETVNDGAESAMFKQLFQKW +SVKDQTMGLGKTFSIGKIAKVFQDKFDVTLLHTKPEVAAQERMVDDGNGKVEVWRIENLELVPVEYQWY +GFFYGGDCYLVLYTYEVNGKPHHILYIWQGRHASQDELAASAYQAVEVDRQFDGAAVQVRVRMGTEPRH +FMAIFKGKLVIFEGGTSRKGNAEPDPPVRLFQIHGNDKSNTKAVEVPAFASSLNSNDVFLLRTQAEHYL +WYGKGSSGDERAMAKELASLLCDGSENTVAEGQEPAEFWDLLGGKTPYANDKRLQQEILDVQSRLFECS +NKTGQFVVTEITDFTQDDLNPTDVMLLDTWDQVFLWIGAEANATEKESALATAQQYLHTHPSGRDPDTP +ILIIKQGFEPPIFTGWFLAWDPNIWSAGKTYEQLKEELGDAAAIMRITADMKNATLSLNSNDSEPKYYP +IAVLLKNQNQELPEDVNPAKKENYLSEQDFVSVFGITRGQFAALPGWKQLQMKKEKGLF >Q9UPA5 | Homo sapiens (Human). | NCBI_TaxID=9606; | 3926 | Name=BS +N; Synonyms=KIAA0434, ZNF231; MGNEVSLEGGAGDGPLPPGGAGPGPGPGPGPGAGKPPSAPAGGGQLPAAGAARSTAVPPVPGPGPGPGPG +PGPGSTSRRLDPKEPLGNQRAASPTPKQASATTPGHESPRETRAQGPAGQEADGPRRTLQVDSRTQRSG +RSPSVSPDRGSTPTSPYSVPQIAPLPSSTLCPICKTSDLTSTPSQPNFNTCTQCHNKVCNQCGFNPNPH +LTQVKEWLCLNCQMQRALGMDMTTAPRSKSQQQLHSPALSPAHSPAKQPLGKPDQERSRGPGGPQPGSR +QAETARATSVPGPAQAAAPPEVGRVSPQPPQPTKPSTAEPRPPAGEAPAKSATAVPAGLGATEQTQEGL +TGKLFGLGASLLTQASTLMSVQPEADTQGQPAPSKGTPKIVFNDASKEAGPKPLGSGPGPGPAPGAKTE +PGARMGPGSGPGALPKTGGTTSPKHGRAEHQAASKAAAKPKTMPKERAICPLCQAELNVGSKSPANYNT +CTTCRLQVCNLCGFNPTPHLVEKTEWLCLNCQTKRLLEGSLGEPTPLPPPTSQQPPVGAPHRASGTSPL +KQKGPQGLGQPSGPLPAKASPLSTKASPLPSKASPQAKPLRASEPSKTPSSVQEKKTRVPTKAEPMPKP +PPETTPTPATPKVKSGVRRAEPATPVVKAVPEAPKGGEAEDLVGKPYSQDASRSPQSLSDTGYSSDGIS +SSQSEITGVVQQEVEQLDSAGVTGPHPPSPSEIHKVGSSMRPLLQAQGLAPSERSKPLSSGTGEEQKQR +PHSLSITPEAFDSDEELEDILEEDEDSAEWRRRREQQDTAESSDDFGSQLRHDYVEDSSEGGLSPLPPQ +PPARAAELTDEDFMRRQILEMSAEEDNLEEDDTATSGRGLAKHGTQKGGPRPRPEPSQEPAALPKRRLP +HNATTGYEELLPEGGSAEATDGSGTLQGGLRRFKTIELNSTGSYGHELDLGQGPDPSLDREPELEMESL +TGSPEDRSRGEHSSTLPASTPSYTSGTSPTSLSSLEEDSDSSPSRRQRLEEAKQQRKARHRSHGPLLPT +IEDSSEEEELREEEELLREQEKMREVEQQRIRSTARKTRRDKEELRAQRRRERSKTPPSNLSPIEDASP +TEELRQAAEMEELHRSSCSEYSPSPSLDSEAEALDGGPSRLYKSGSEYNLPTFMSLYSPTETPSGSSTT +PSSGRPLKSAEEAYEEMMRKAELLQRQQGQAAGARGPHGGPSQPTGPRGLGSFEYQDTTDREYGQAAQP +AAEGTPASLGAAVYEEILQTSQSIVRMRQASSRDLAFAEDKKKEKQFLNAESAYMDPMKQNGGPLTPGT +SPTQLAAPVSFSTPTSSDSSGGRVIPDVRVTQHFAKETQDPLKLHSSPASPSSASKEIGMPFSQGPGTP +ATTAVAPCPAGLPRGYMTPASPAGSERSPSPSSTAHSYGHSPTTANYGSQTEDLPQAPSGLAAAGRAAR +EKPLSASDGEGGTPQPSRAYSYFASSSPPLSPSSPSESPTFSPGKMGPRATAEFSTQTPSPAPASDMPR +SPGAPTPSPMVAQGTQTPHRPSTPRLVWQESSQEAPFMVITLASDASSQTRMVHASASTSPLCSPTETQ +PTTHGYSQTTPPSVSQLPPEPPGPPGFPRVPSAGADGPLALYGWGALPAENISLCRISSVPGTSRVEPG +PRTPGTAVVDLRTAVKPTPIILTDQGMDLTSLAVEARKYGLALDPIPGRQSTAVQPLVINLNAQEHTFL +ATATTVSITMASSVFMAQQKQPVVYGDPYQSRLDFGQGGGSPVCLAQVKQVEQAVQTAPYRSGPRGRPR +EAKFARYNLPNQVAPLARRDVLITQMGTAQSIGLKPGPVPEPGAEPHRATPAELRSHALPGARKPHTVV +VQMGEGTAGTVTTLLPEEPAGALDLTGMRPESQLACCDMVYKLPFGSSCTGTFHPAPSVPEKSMADAAP +PGQSSSPFYGPRDPEPPEPPTYRAQGVVGPGPHEEQRPYPQGLPGRLYSSMSDTNLAEAGLNYHAQRIG +QLFQGPGRDSAMDLSSLKHSYSLGFADGRYLGQGLQYGSVTDLRHPTDLLAHPLPMRRYSSVSNIYSDH +RYGPRGDAVGFQEASLAQYSATTAREISRMCAALNSMDQYGGRHGSGGGGPDLVQYQPQHGPGLSAPQS +LVPLRPGLLGNPTFPEGHPSPGNLAQYGPAAGQGTAVRQLLPSTATVRAADGMIYSTINTPIAATLPIT +TQPASVLRPMVRGGMYRPYASGGITAVPLTSLTRVPMIAPRVPLGPTGLYRYPAPSRFPIASSVPPAEG +PVYLGKPAAAKAPGAGGPSRPEMPVGAAREEPLPTTTPAAIKEAAGAPAPAPLAGQKPPADAAPGGGSG +ALSRPGFEKEEASQEERQRKQQEQLLQLERERVELEKLRQLRLQEELERERVELQRHREEEQLLVQREL +QELQTIKHHVLQQQQEERQAQFALQREQLAQQRLQLEQIQQLQQQLQQQLEEQKQRQKAPFPAACEAPG +RGPPLAAAELAQNGQYWPPLTHAAFIAMAGPEGLGQPREPVLHRGLPSSASDMSLQTEEQWEASRSGIK +KRHSMPRLRDACELESGTEPCVVRRIADSSVQTDDEDGESRYLLSRRRRARRSADCSVQTDDEDSAEWE +QPVRRRRSRLPRHSDSGSDSKHDATASSSSAAATVRAMSSVGIQTISDCSVQTEPDQLPRVSPAIHITA +ATDPKVEIVRYISAPEKTGRGESLACQTEPDGQAQGVAGPQLVGPTAISPYLPGIQIVTPGPLGRFEKK +KPDPLEIGYQAHLPPESLSQLVSRQPPKSPQVLYSPVSPLSPHRLLDTSFASSERLNKAHVSPQKHFTA +DSALRQQTLPRPMKTLQRSLSDPKPLSPTAEESAKERFSLYQHQGGLGSQVSALPPNSLVRKVKRTLPS +PPPEEAHLPLAGQASPQLYAASLLQRGLTGPTTVPATKASLLRELDRDLRLVEHESTKLRKKQAELDEE +EKEIDAKLKYLELGITQRKESLAKDRGGRDYPPLRGLGEHRDYLSDSELNQLRLQGCTTPAGQFVDFPA +TAAAPATPSGPTAFQQPRFQPPAPQYSAGSGGPTQNGFPAHQAPTYPGPSTYPAPAFPPGASYPAEPGL +PNQQAFRPTGHYAGQTPMPTTQSTLFPVPADSRAPLQKPRQTSLADLEQKVPTNYEVIASPVVPMSSAP +SETSYSGPAVSSGYEQGKVPEVPRAGDRGSVSQSPAPTYPSDSHYTSLEQNVPRNYVMIDDISELTKDS +TSTAPDSQRLEPLGPGSSGRPGKEPGEPGVLDGPTLPCCYARGEEESEEDSYDPRGKGGHLRSMESNGR +PASTHYYGDSDYRHGARVEKYGPGPMGPKHPSKSLAPAAISSKRSKHRKQGMEQKISKFSPIEEAKDVE +SDLASYPPPAVSSSLVSRGRKFQDEITYGLKKNVYEQQKYYGMSSRDAVEDDRIYGGSSRSRAPSAYSG +EKLSSHDFSGWGKGYEREREAVERLQKAGPKPSSLSMAHSRVRPPMRSQASEEESPVSPLGRPRPAGGP +LPPGGDTCPQFCSSHSMPDVQEHVKDGPRAHAYKREEGYILDDSHCVVSDSEAYHLGQEETDWFDKPRD +ARSDRFRHHGGHAVSSSSQKRGPARHSYHDYDEPPEEGLWPHDEGGPGRHASAKEHRHGDHGRHSGRHT +GEEPGRRAAKPHARDLGRHEARPHSQPSSAPAMPKKGQPGYPSSAEYSQPSRASSAYHHASDSKKGSRQ +AHSGPAALQSKAEPQAQPQLQGRQAAPGPQQSQSPSSRQIPSGAASRQPQTQQQQQGLGLQPPQQALTQ +ARLQQQSQPTTRGSAPAASQPAGKPQPGPSTATGPQPAGPPRAEQTNGSKGTAKAPQQGRAPQAQPAPG +PGPAGVKAGARPGGTPGAPAGQPGADGESVFSKILPGGAAEQAGKLTEAVSAFGKKFSSFW >Q9NSI6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 2320 | Name=BR +WD1; Synonyms=C21orf107, WDR9; MAEPSSARRPVPLIESELYFLIARYLSAGPCRRAAQVLVQELEQYQLLPKRLDWEGNEHNRSYEELVLSN +KHVAPDHLLQICQRIGPMLDKEIPPSISRVTSLLGAGRQSLLRTAKDCRHTVWKGSAFAALHRGRPPEM +PVNYGSPPNLVEIHRGKQLTGCSTFSTAFPGTMYQHIKMHRRILGHLSAVYCVAFDRTGHRIFTGSDDC +LVKIWSTHNGRLLSTLRGHSAEISDMAVNYENTMIAAGSCDKIIRVWCLRTCAPVAVLQGHTGSITSLQ +FSPMAKGSQRYMVSTGADGTVCFWQWDLESLKFSPRPLKFTEKPRPGVQMLCSSFSVGGMFLATGSTDH +VIRMYFLGFEAPEKIAELESHTDKVDSIQFCNNGDRFLSGSRDGTARIWRFEQLEWRSILLDMATRISG +DLSSEEERFMKPKVTMIAWNQNDSIVVTAVNDHVLKVWNSYTGQLLHNLMGHADEVFVLETHPFDSRIM +LSAGHDGSIFIWDITKGTKMKHYFNMIEGQGHGAVFDCKFSQDGQHFACTDSHGHLLIFGFGCSKPYEK +IPDQMFFHTDYRPLIRDSNNYVLDEQTQQAPHLMPPPFLVDVDGNPHPTKYQRLVPGRENSADEHLIPQ +LGYVATSDGEVIEQIISLQTNDNDERSPESSILDGMIRQLQQQQDQRMGADQDTIPRGLSNGEETPRRG +FRRLSLDIQSPPNIGLRRSGQVEGVRQMHQNAPRSQIATERDLQAWKRRVVVPEVPLGIFRKLEDFRLE +KGEEERNLYIIGRKRKTLQLSHKSDSVVLVSQSRQRTCRRKYPNYGRRNRSWRELSSGNESSSSVRHET +SCDQSEGSGSSEEDEWRSDRKSESYSESSSDSSSRYSDWTADAGINLQPPLRTSCRRRITRFCSSSEDE +ISTENLSPPKRRRKRKKENKPKKENLRRMTPAELANMEHLYEFHPPVWITDTTLRKSPFVPQMGDEVIY +FRQGHEAYIEAVRRNNIYELNPNKEPWRKMDLRDQELVKIVGIRYEVGPPTLCCLKLAFIDPATGKLMD +KSFSIRYHDMPDVIDFLVLRQFYDEARQRNWQSCDRFRSIIDDAWWFGTVLSQEPYQPQYPDSHFQCYI +VRWDNTEIEKLSPWDMEPIPDNVDPPEELGASISVTTDELEKLLYKPQAGEWGQKSRDEECDRIISGID +QLLNLDIAAAFAGPVDLCTYPKYCTVVAYPTDLYTIRMRLVNRFYRRLSALVWEVRYIEHNARTFNEPE +SVIARSAKKITDQLLKFIKNQHCTNISELSNTSENDEQNAEDLDDSDLPKTSSGRRRVHDGKKSIRATN +YVESNWKKQCKELVNLIFQCEDSEPFRQPVDLVEYPDYRDIIDTPMDFGTVRETLDAGNYDSPLEFCKD +IRLIFSNAKAYTPNKRSKIYSMTLRLSALFEEKMKKISSDFKIGQKFNEKLRRSQRFKQRQNCKGDSQP +NKSIRNLKPKRLKSQTKIIPELVGSPTQSTSSRTAYLGTHKTSAGISSGVTSGDSSDSAESSERRKRNR +PITNGSTLSESEVEDSLATSLSSSASSSSEESKESSRARESSSRSGLSRSSNLRVTRTRAAQRKTGPVS +LANGCGRKATRKRVYLSDSDNNSLETGEILKARAGNNRKVLRKCAAVAANKIKLMSDVEENSSSESVCS +GRKLPHRNASAVARKKLLHNSEDEQSLKSEIEEEELKDENQPLPVSSSHTAQSNVDESENRDSESESDL +RVARKNWHANGYKSHTPAPSKTKFLKIESSEEDSKSHDSDHACNRTAGPSTSVQKLKAESISEEADSEP +GRSGGRKYNTFHKNASFFKKTKILSDSEDSESEEQDREDGKCHKMEMNPISGNLNCDPIAMSQCSSDHG +CETDLDSDDDKIEKPNNFMKDSASQDNGLSRKISRKRVCSSDSDSSLQVVKKSSKARTGLLRITRRCAA +TAANKIKLMSDVEDVSLENVHTRSKNGRKKPLHLACTTAKKKLSDCEGSVHCEVPSEQYACEGKPPDPD +SEGSTKVLSQALNGDSDSEDMLNSEHKHRHTNIHKIDAPSKRKSSSVTSSGEDSKSHIPGSETDRTFSS +ESTLAQKATAENNFEVELNYGLRRWNGRRLRTYGKAPFSKTKVIHDSQETAEKEVKRKRSHPELENVKI +SETTGNSKFRPDTSSKSSDLGSVTESDIDCTDNTKTKRRKTKGKAKVVRKEFVPRDREPNTKVRTCMHN +QKDAVQMPSETLKAKMVPEKVPRRCATVAANKIKIMSNLKETISGPENVWIRKSSRKLPHRNASAAAKK +KLLNVYKEDDTTINSESEKELEDINRKMLFLRGFRSWKENAQ >Q96KE9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 485 | Name=BTB +D6; Synonyms=BDPL; MAAELYAPASAAAADLANSNAGAAVGRKAGPRSPPSAPAPAPPPPAPAPPTLGNNHQESPGWRCCRPTLR +ERNALMFNNELMADVHFVVGPPGATRTVPAHKYVLAVGSSVFYAMFYGDLAEVKSEIHIPDVEPAAFLI +LLKYMYSDEIDLEADTVLATLYAAKKYIVPALAKACVNFLETSLEAKNACVLLSQSRLFEEPELTQRCW +EVIDAQAEMALRSEGFCEIDRQTLEIIVTREALNTKEAVVFEAVLNWAEAECKRQGLPITPRNKRHVLG +RALYLVRIPTMTLEEFANGAAQSDILTLEETHSIFLWYTATNKPRLDFPLTKRKGLAPQRCHRFQSSAY +RSNQWRYRGRCDSIQFAVDRRVFIAGLGLYGSSSGKAEYSVKIELKRLGVVLAQNLTKFMSDGSSNTFP +VWFEHPVQVEQDTFYTASAVLDGSELSYFGQEGMTEVQCGKVAFQFQCSSDSTNGTGVQGGQIPELIFY +A >P0C7T9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 278 | Name=BZW +1L1; MENSERNKLAMLTGVLLANGTLNASILNSLYNENLVKEGVSAAFAVKLFKSWINEKDINAVAASLRKVSM +DNRLMELFPANKQSVEHFTKYFTEAGLKELSEYVRNQQTIGARKELQKELQEQMSRGDPFKDIILYVKE +EMKKNNIPEPVVIGIVWSSVMSTVEWNKKEELVAEQAIKHLKQYSPLLAAFTTQGQSELTLLLKIQEYC +YDNIHFMKAFQKIVVLFYKAEVLSEEPILKWYKDAHVAKGKSVFLEQMKKFVEWLKNAEEESESEAEEG +D >Q8IYA2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1237 | Name=CC +DC144C; MVSWGGEKRGGAEGSPKPAVYATRKTGSVRSQEDQWYLGYPGDQWSSGFSYSWWKNSVGSESKHGEGALD +QPQHDVRLEDLGELHRAARSGDVPGVEHVLVPGDTGVDKRDRKKSIQQLVPEYKEKQTPESLPQNNNPD +WHPTNLTLSDETCQRSKNLKVDDKCPSVSPSMPENQSATKELGQMNLTEREKMDTGVKTSQEPEMAKDC +DREDIPIYPVLPHVQKSEEMRIEQGKLEWKNQLKLVINELKQRFGEIYEKYKIPACPEEEPLLDNSTRG +TDVKDIPFNLTNNIPGCEEEDASEISVSVVFETFPEQKEPSLKNIIHSYYHPYSGSQEHVCQSSSKLHL +HENKLDCDNDNKPGIGHIFSTDKNFHNDASTKKARNPEVVTVEMKEDQEFDLQMTKNMNQNSDSGSTNN +YKSLKPKLENLSSLPPDSDRTSEVYLHEELQQDMQKFKNEVNTLEEEFLALKKENVQLHKEVEEEMEKH +RSNSTELSGTLTDGTTVGNDDDGLNQQIPRKENGEHDRLALKQENEEKRNADMLYNKDSEQLRIKEEEC +GKVVETKQQLKWNLRRLVKELRTVVQERNDAQKQLSEEQDARILQDQILTSKQKELEMAQKKRNPEISH +RHQKEKDLFHENCMLQEEIALLRLEIDTIKNQNKQKEKKYFEDIEVVKEKNDNLQKIIKRNEETLTETI +LQYSGQLNNLTAENKMLNSELENGKENQERLEIEMESYRCRLAAAVHDCDQSQTARDLKLDFQRTRQEW +VRLHDKMKVDMSGLQAKNEILSEKLSNAESKINSLQIQLHNTRDALGRESLILERVQRDLSQTQCQKKE +TEQMYQSKLKKYIAKQESVEERLSQLQSENMLLRQQLDDVHKKANSQEKTISTIQDQFHSAAKNLQAES +EKQILSLQEKNKELMDEYNHLKERMDQCEKEKAGRKIDLTEAQETVPSRCLHLDAENEVLQLQQTLFSM +KAIQKQCETLQKNKKQLKQEVVNLKSYMERNMLERGEAEWHKLLIEERARKEIEEKLNEAILTLQKQAA +VSHEQLAQLREDNTTSIKTQMELTVIDLESEISRIKTSQADFNKTKLERYKELYLEEVKVRESLSNELS +RTNEMIAEVSTQLTVEKEQTRSRSLFTAYATRPVLESPCVGNLNDSEGLNRKHIPRKKRSALKDMESYL +LKMQQKLQNDLTAEVAGSSQTGLHRIPQCSSFSSSSLHLLLCSICQPFFLILQLLLNMNLDPI >A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 | Name=DI +SP2; Synonyms=DISPB, KIAA1742; MDGDSSSSSGGSGPAPGPGPEGEQRPEGEPLAPDGGSPDSTQTKAVPPEASPERSCSLHSCPLEDPSSSS +GPPPTTSTLQPVGPSSPLAPAHFTYPRALQEYQGGSSLPGLGDRAALCSHGSSLSPSPAPSQRDGTWKP +PAVQHHVVSVRQERAFQMPKSYSQLIAEWPVAVLMLCLAVIFLCTLAGLLGARLPDFSKPLLGFEPRDT +DIGSKLVVWRALQALTGPRKLLFLSPDLELNSSSSHNTLRPAPRGSAQESAVRPRRMVEPLEDRRQENF +FCGPPEKSYAKLVFMSTSSGSLWNLHAIHSMCRMEQDQIRSHTSFGALCQRTAANQCCPSWSLGNYLAV +LSNRSSCLDTTQADAARTLALLRTCALYYHSGALVPSCLGPGQNKSPRCAQVPTKCSQSSAIYQLLHFL +LDRDFLSPQTTDYQVPSLKYSLLFLPTPKGASLMDIYLDRLATPWGLADNYTSVTGMDLGLKQELLRHF +LVQDTVYPLLALVAIFFGMALYLRSLFLTLMVLLGVLGSLLVAFFLYQVAFRMAYFPFVNLAALLLLSS +VCANHTLIFFDLWRLSKSQLPSGGLAQRVGRTMHHFGYLLLVSGLTTSAAFYASYLSRLPAVRCLALFM +GTAVLVHLALTLVWLPASAVLHERYLARGCARRARGRWEGSAPRRLLLALHRRLRGLRRAAAGTSRLLF +QRLLPCGVIKFRYIWICWFAALAAGGAYIAGVSPRLRLPTLPPPGGQVFRPSHPFERFDAEYRQLFLFE +QLPQGEGGHMPVVLVWGVLPVDTGDPLDPRSNSSLVRDPAFSASGPEAQRWLLALCHRARNQSFFDTLQ +EGWPTLCFVETLQRWMESPSCARLGPDLCCGHSDFPWAPQFFLHCLKMMALEQGPDGTQDLGLRFDAHG +SLAALVLQFQTNFRNSPDYNQTQLFYNEVSHWLAAELGMAPPGLRRGWFTSRLELYSLQHSLSTEPAVV +LGLALALAFATLLLGTWNVPLSLFSVAAVAGTVLLTVGLLVLLEWQLNTAEALFLSASVGLSVDFTVNY +CISYHLCPHPDRLSRVAFSLRQTSCATAVGAAALFAAGVLMLPATVLLYRKLGIILMMVKCVSCGFASF +FFQSLCCFFGPEKNCGQILWPCAHLPWDAGTGDPGGEKAGRPRPGSVGGMPGSCSEQYELQPLARRRSP +SFDTSTATSKLSHRPSVLSEDLQLHDGPCCSRPPPAPASPRELLLDHQAVFSQCPALQTSSPYKQAGPS +PKTRARQDSQGEEAEPLPASPEAPAHSPKAKAADPPDGFCSSASTLEGLSVSDETCLSTSEPSARVPDS +VGVSPDDLDDTGQPVLERGQLNGKRDTLWLALRETVYDPSLPASHHSSLSWKGRGGPGDGSPVVLPNSQ +PDLPDVWLRRPSTHTSGYSS >Q96HU8 | Homo sapiens (Human). | NCBI_TaxID=9606; | 199 | Name=DIR +AS2; MPEQSNDYRVAVFGAGGVGKSSLVLRFVKGTFRESYIPTVEDTYRQVISCDKSICTLQITDTTGSHQFPA +MQRLSISKGHAFILVYSITSRQSLEELKPIYEQICEIKGDVESIPIMLVGNKCDESPSREVQSSEAEAL +ARTWKCAFMETSAKLNHNVKELFQELLNLEKRRTVSLQIDGKKSKQQKRKEKLKGKCVIM >Q8N4W6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 341 | Name=DNA +JC22; MAKGLLVTYALWAVGGPAGLHHLYLGRDSHALLWMLTLGGGGLGWLWEFWKLPSFVAQANRAQGQRQSPR +GVTPPLSPIRFAAQVIVGIYFGLVALISLSSMVNFYIVALPLAVGLGVLLVAAVGNQTSDFKNTLGSAF +LTSPIFYGRPIAILPISVAASITAQRHRRYKALVASEPLSVRLYRLGLAYLAFTGPLAYSALCNTAATL +SYVAETFGSFLNWFSFFPLLGRLMEFVLLLPYRIWRLLMGETGFNSSCFQEWAKLYEFVHSFQDEKRQL +AYQVLGLSEGATNEEIHRSYQELVKVWHPDHNLDQTEEAQRHFLEIQAAYEVLSQPRKPWGSRR
        I think we're getting somewhere, though wrapping the data in a <readmore> tag would be helpful for later reviewers of the thread. Rather than slurping the file, you can use the newlines to split alternating records. A base level while on the file handle, followed by an if on whether it is a header or not, and then a while(m//g) to iterate over the matches. Since you throw away the word characters you were matching before and after your sequence capture, you can drop those so pos and length will yield your expected result (see jwkrahn's insight).

        outputs:

        >P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA +-A; Synonyms=HLAA; contains AVVAAVMW at position 324 >A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 | Name=DI +SP2; Synonyms=DISPB, KIAA1742; contains VAVLMLCLAVIFLC at position 169 contains LLALVAIFF at position 492 contains IWICWFAALAA at position 704 contains LALALAFA at position 969

        Note you'd omitted 'I' from your original character set.

      those sequences you don't see because they are in the file I'm parsing not what I posted otherwise I'd have to post the entire 3000 line file. my experience is pretty novice I've been working with perl for about 3 months but I have to write this code myself I don't have any other person's script that I'm editing to make it. It's a bit more complicated than what you gave me as a response in your code because If you match the first in the header, store it. If then on the next line iteration you have a match to the residues, print out the header and the match. If there's additional matches in the sequence, like the output shows, only print the name of the sequence once, but print out the other hits, like I've shown in the output. I can easily pull out all the sequences from the file or all the headers but I want to store the header only if the sequence(s) are present on the subsequent line then print them all at once. my issue is not with working with the regular expressions but more on deciding which kind of loop would be best suited for the task. Thank you and sorry about my posting etiquette.

        Everyone learns by making mistakes. Demonstrate you can learn from those, and you will be well-regarded on this forum.

        If you post input and desired output (good), make sure they correspond. In this case, only including the first section of output would have been appropriate, so the two match up.

        I believe this works closer to your spec; it clears the contents of $header after the first print, so it will only appear once.

        #!/usr/bin/perl use strict; use warnings; local $/; # Slurp my $content = <DATA>; my ($header) = $content =~ /^(>.*?\n)/m; while ($content =~ /^[\w]*?([VMFWLCA]{8,})[\w]*?$/mg) { my $sequence = $1; print $header, "contains $sequence at position ", pos($content) - +length($sequence), "\n"; $header = ""; } __DATA__ >P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; + Synonyms=HLAA;M MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDT QFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGS TIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKW ETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATL RCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQ RYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKS SDRKGGSYSQAASSDSAQGSDMSLTACKVVAVLMLCLAVIFLC

        outputs:

        >P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; + Synonyms=HLAA;M contains AVVAAVMW at position 420 contains VVAVLMLCLAV at position 461

        Note I've changed you + to * so that your regular expression can also match entries at the start and ends of lines, not just in the middle. I assume this was an oversight on your part; sorry if this assumption is incorrect.