Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow Monks and Happy New Year to all of you!
I wanted to ask for your wisdom regarding regular expression matters.
Specifically, I have a bunch of strings and their substrings, and I want to write a program that can look and find where in the big string, the substring exists.
My problem is that, for example, I can have:
$bigstring = "MYRKSTLAMLIALLTSAASAHAQTDISTIEARLNALEKRLQEAENRAQTAENRAGA +AEKKVQQLTAQQQKNQNSTQEVAQRTARLEKKADDKSGFEFHGYARSGVIMNDSGASTKSGAYITPAGE +TGGAIGRLGNQADTYVEMNLEHKQTLDNGATTRFKVMVADGQTSYNDWTASTSDLNVRQAFVELGNLPT +FAGPFKGSTLWAGKRFDRDNFDIHWIDSDVVFLAGTGGGIYDVKWNDGLRSNFSLYGRNFGDIDDSSNS +VQNYILTMNHFAGPLQMMVSGLRAKDNDERKDSNGNLAKGDAANTGVHALLGLHNDSFYGLRDGSSKTA +LLYGHGLGAEVKGIGSDGALRPGADTWRIASYGTTPLSENWSVAPAMLAQRSKDRYADGDSYQWATFNL +RLIQAINQNFALAYEGSYQYMDLKPEGYNDRQAVNGSFYKLTFAPTFKVGSIGDFFSRPEIRFYTSWMD +WSKKLNNYASDDALGSDGFNSGGEWSFGVQMETWF"; $smallstring ="SGFEFHGYARSGVIMNDSGASTKSGAYITPAGETGGAIGRLGNQADTYVEMNLEH +KQTLDNGATTRFKVMVADGQTSYNDWTASTSDLNVRQAFVELGNLPTFAGPFKGSTLWAGKRFDRDNFD +IHWIDSDVVFLAGTGGGIYDVKWNDGLRSNFSLYGRNFGDIDDSSNSVQNYILTMNHFAGPLQMMVSGL +RAKDNDERKDSNGNLAKGDAANTGVHALLGLHNDSFYGLRDGSSKTALLYGHGLGAEVKGIGSDGALRP +GADTWRIASYGTTPLSENWSVAPAMLAQRSKDRYADGDSYQWATFNLRLIQAINQNFALAYEGSYQYMD +LKPEGYNDRQAVNGSFYKLTFAPTFKVGSIGDFFSRPEIRFYTSWMDWSKKLNNYASDDALGSDGFNSG +GEWSFGVQMETWF";

where I could easily find where the substring lies (because it is included as a whole in the big string, BUT, there are also cases like:
$bigstring="MNRIYSLRYSAVARGFIAVSEFARKCVHKSVRRLCFPVLLLIPVLFSAGSLAGTVNNE +LGYQLFRDFAENKGMFRPGATNIAIYNKQGEFVGTLDKAAMPDFSAVDSEIGVATLINPQYIASVKHNG +GYTNVSFGDGENRYNIVDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPVFYRLGSG +TQYIKDSNGQLTKMGGAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSPLFA +FDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTGAGA +LTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAGIVV +DNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIASGR +PTVVLTDERQVNPDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVLANNVDKRATITLDYALRADKVAL +NGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNATWEFVGHSQGDAQKLVADRFNTA +GYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADISGTFTQENGRLTLQGHPVIHAYNTQSVADKLAASG +DHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGLGRNATLNTTIQADNSSVTLGDSRVFIDKNDGQGT +AFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNINDIFNGGIQANNSTVNISSDSAVLGNSTLTSTALN +LNKGANALASQSFVSDGPVNISDATLSLNSRPDEVSHTLLPVYDYAGSWNLKGDDARLNVGPYSMLSGN +INVQDKGTVTLGGEGELSPDLTLQNQMLYSLFNGYRNIWSGSLNAPDATVSMTDTQWSMNGNSTAGNMK +LNRTIVGFNGGTSPFTTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNKDTL +DIPLVSAPEATADNLFRASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHISYN +NFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMAT +YTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAE +VGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTA +RAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAI +NANIRYSF"; $smallstring="GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSG +SADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYF +DVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDS +GMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSR +MLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINANIRYSFLE";

where NOT all, but more than 90% of the small string is included in the bi string... But, in this case, only by using the index function, I wouldn't be able to identify that my small string DOES exist in the big string, but with some letters of it missing...
How could I proceed in this case?
Thanks in advance!

Replies are listed 'Best First'.
Re: Tips on how to perform this regex query
by BrowserUk (Patriarch) on Jan 11, 2014 at 10:44 UTC
    where NOT all, but more than 90% of the small string is included in the bi string

    Hm. Your sample does not demonstrate that. The best match is 12.01% (at offset 575).

    The simplest (and fastest pure Perl) method is to use bitwise-xor masking at each of the possible offsets and the count the number of nulls (zero-bytes) in the result. Here is code that demonstrates that:

    #! perl -slw use strict; my $bigstring="MNRIYSLRYSAVARGFIAVSEFARKCVHKSVRRLCFPVLLLIPVLFSAGSLAGTV +NNELGYQLFRDFAENKGMFRPGATNIAIYNKQGEFVGTLDKAAMPDFSAVDSEIGVATLINPQYIASVK +HNGGYTNVSFGDGENRYNIVDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPVFYRL +GSGTQYIKDSNGQLTKMGGAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSP +LFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTG +AGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAG +IVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIA +SGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVLANNVDKRATITLDYALRADK +VALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNATWEFVGHSQGDAQKLVADRF +NTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADISGTFTQENGRLTLQGHPVIHAYNTQSVADKLA +ASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGLGRNATLNTTIQADNSSVTLGDSRVFIDKNDG +QGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNINDIFNGGIQANNSTVNISSDSAVLGNSTLTST +ALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEVSHTLLPVYDYAGSWNLKGDDARLNVGPYSML +SGNINVQDKGTVTLGGEGELSPDLTLQNQMLYSLFNGYRNIWSGSLNAPDATVSMTDTQWSMNGNSTAG +NMKLNRTIVGFNGGTSPFTTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNK +DTLDIPLVSAPEATADNLFRASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHI +SYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGV +MATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYA +GAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWS +LTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTD +DAINANIRYSF"; my $smallstring="GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLN +GSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSG +AYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNW +NDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRK +DSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINANIRYSFLE"; my $lenBig = length $bigstring; my $lenSmall = length $smallstring; for my $o ( 0 .. ( $lenBig - $lenSmall + 1 ) ) { my $masked = substr( $bigstring, $o, $lenSmall ) ^ $smallstring; my $matched = $masked =~ tr[\0][]; if( ( $matched / $lenSmall ) > 0.10 ) { $masked =~ tr[\1-\255][ ]; $masked =~ tr[\0][*]; printf "%.2f%% match at offset %u\n", $matched / $lenSmall * 1 +00, $o; print substr $bigstring, $o, $lenSmall; print $smallstring; print $masked; } }

    The output:

    C:\test>1070240.pl 10.06% match at offset 50 LAGTVNNELGYQLFRDFAENKGMFRPGATNIAIYNKQGEFVGTLDKAAMPDFSAVDSEIGVATLINPQYI +ASVKHNGGYTNVSFGDGENRYNIVDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPV +FYRLGSGTQYIKDSNGQLTKMGGAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGD +SGSPLFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFN +SSTGAGALTQGTTTYAMHGQQGNDLNAGKNL GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * * * * * + * * * * * * + * * * * ** * * * + * ** * * * * + * * 10.06% match at offset 143 VDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPVFYRLGSGTQYIKDSNGQLTKMGGA +YSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSPLFAFDTVQNKWVLVGVLTAG +NGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTGAGALTQGTTTYAMHGQQGND +LNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAGIVVDNGVSVNWQVNGVKGDN +LHKIGEGTLTVQGTGINEGGLKVGDGKVVLN GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * * * * + ** * * * ** + * * * * * * * + * * * * * *** + * * * 10.71% match at offset 211 GAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSPLFAFDTVQNKWVLVGVLTA +GNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTGAGALTQGTTTYAMHGQQGN +DLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAGIVVDNGVSVNWQVNGVKGD +NLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIASGRPTVVLTDERQVNPDTV +SWGYRGGTLDVNGNSLTFHQLKAADYGAVLA GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * **** * * + * * ** * + ** * * * * * * * + * * * ** * + * * * * 11.04% match at offset 227 YQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSPLFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPL +DFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTGAGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQ +INLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAGIVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTG +INEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIASGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSL +TFHQLKAADYGAVLANNVDKRATITLDYALR GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * * * * + * * * * * * * + * * * * * * * * + * * * * * * * * +* * * * * 11.69% match at offset 276 GVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTGAGALTQGTTTYAMHG +QQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAGIVVDNGVSVNWQVNG +VKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIASGRPTVVLTDERQVN +PDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVLANNVDKRATITLDYALRADKVALNGWSESGKGTAG +NLYKYNNPYTNTTDYFILKQSTYGYFPTDQS GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * * * * * * * * * * + * * * ** * * + * * * * * * ** + * * * * * * * * + * 11.04% match at offset 310 PVTFRTSEGGALEWSFNSSTGAGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSL +TFRDNYTVTTSNGSTWTGAGIVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVV +LNQQADNKGQVQAFSSVNIASGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVL +ANNVDKRATITLDYALRADKVALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNA +TWEFVGHSQGDAQKLVADRFNTAGYLFHGQL GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * * * + * * * * * * * +** * * * *** * * * * * * + * * ** +* * * * 10.06% match at offset 536 DKVALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNATWEFVGHSQGDAQKLVADR +FNTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADISGTFTQENGRLTLQGHPVIHAYNTQSVADKL +AASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGLGRNATLNTTIQADNSSVTLGDSRVFIDKND +GQGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNINDIFNGGIQANNSTVNISSDSAVLGNSTLTS +TALNLNKGANALASQSFVSDGPVNISDATLS GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * * * + * * * * * * * + * * * * * + * * * * * * * +* * * * ** * 12.01% match at offset 575 YGYFPTDQSSNATWEFVGHSQGDAQKLVADRFNTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADIS +GTFTQENGRLTLQGHPVIHAYNTQSVADKLAASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGL +GRNATLNTTIQADNSSVTLGDSRVFIDKNDGQGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNIN +DIFNGGIQANNSTVNISSDSAVLGNSTLTSTALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEV +SHTLLPVYDYAGSWNLKGDDARLNVGPYSML GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * * * * * * * * + * * * ** * * * + * * * * * * + * * * * * ** * + * * * * * 10.06% match at offset 813 TALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEVSHTLLPVYDYAGSWNLKGDDARLNVGPYSML +SGNINVQDKGTVTLGGEGELSPDLTLQNQMLYSLFNGYRNIWSGSLNAPDATVSMTDTQWSMNGNSTAG +NMKLNRTIVGFNGGTSPFTTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNK +DTLDIPLVSAPEATADNLFRASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHI +SYNNFITEVNNLNKRMGDLRDINGEAGTWVR GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * * * * + * * * * *** + * ** * + * * * * * * * * * * * + * * * 11.69% match at offset 970 TTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNKDTLDIPLVSAPEATADNLF +RASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHISYNNFITEVNNLNKRMGDL +RDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTK +SWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQ +AELVWGRLQGQTFNWNDSGMDVSMRRNSVNP GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * * * * * * + * * * * * * * * + * * * * * * * * * + * * * * * * * + * * * * 10.06% match at offset 1054 ILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHISYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLN +GSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSG +AYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNW +NDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRK +DSRMLYGVGLNARFGDNTRLGLEVERSAFGK GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE * * * * * ** + * * ** * * * + * * * ** + ** * * * * * * * + * * *

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      It depends on whether you count deletion and insertion as possible operations, too. But if you do not, xor is a good trick.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        It depends on whether you count deletion and insertion as possible operations, too.

        Have you found an application where 'edit distance' is actually meaningful?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      my $lenSmall = length $smallstring; for my $o ( 0 .. $lenBig - 1 ) { my $masked = substr( $bigstring, $o, $lenSmall ) ^ $smallstring; my $matched = $masked =~ tr[\0][]; if( ( $matched / $lenSmall ) > 0.90 ) { $masked =~ tr[\1-\255][ ]; $masked =~ tr[\0][*]; printf "%.2f%% match at offset %u\n", $matched / $lenSmall * 1 +00, $o; print substr $bigstring, $o, $lenSmall; print $smallstring; print $masked; } }
      98.05% match at offset 1071 YQVARNDGQGKAAATFMHISYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSF GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE ************************* ***************************************** +********************************************************************* +********************************************************************* +********************************************************************* +*****************************

        c:\test\anonymonks.pl Global symbol "$lenBig" requires explicit package name at C:\test\1070 +240.pl line 9. Execution of C:\test\1070240.pl aborted due to compilation errors.
        Correct for that and compare:
        my $bigstring="MNRIYSLRYSA..."; my $smallstring="GTMARNDGQGKAA..."; my $lenBig = length $bigstring; my $lenSmall = length $smallstring; for my $o ( 0 .. ( $lenBig - $lenSmall + 1 ) ) { my $masked = substr( $bigstring, $o, $lenSmall ) ^ $smallstring; my $matched = $masked =~ tr[\0][]; if( ( $matched / $lenSmall ) > 0.095 ) { $masked =~ tr[\1-\255][ ]; $masked =~ tr[\0][*]; printf "%.2f%% match at offset %u\n", $matched / $lenSmall * 1 +00, $o; print substr $bigstring, $o, $lenSmall; print $smallstring; print $masked; } } ]; $otherstring = q[ my $bigstring="MNRIYSLRYSA..."; my $smallstring="GTMARNDGQGKAA..."; my $lenBig = length $bigstring; my $lenSmall = length $smallstring; for my $o ( 0 .. $lenBig - 1 ) { my $masked = substr( $bigstring, $o, $lenSmall ) ^ $smallstring; my $matched = $masked =~ tr[\0][]; if( ( $matched / $lenSmall ) > 0.90 ) { $masked =~ tr[\1-\255][ ]; $masked =~ tr[\0][*]; printf "%.2f%% match at offset %u\n", $matched / $lenSmall * 1 +00, $o; print substr $bigstring, $o, $lenSmall; print $smallstring; print $masked; } } ]; print compare( $firststring, $otherstring ); __END__ 99.4% plagerised

        Original thought is a rare commodity.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        corrected and improved
        #! perl -slw use strict; my $bigstring="MNRIYSLRYSAVARGFIAVSEFARKCVHKSVRRLCFPVLLLIPVLFSAGSLAGTV +NNELGYQLFRDFAENKGMFRPGATNIAIYNKQGEFVGTLDKAAMPDFSAVDSEIGVATLINPQYIASVK +HNGGYTNVSFGDGENRYNIVDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPVFYRL +GSGTQYIKDSNGQLTKMGGAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSP +LFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTG +AGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAG +IVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIA +SGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVLANNVDKRATITLDYALRADK +VALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNATWEFVGHSQGDAQKLVADRF +NTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADISGTFTQENGRLTLQGHPVIHAYNTQSVADKLA +ASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGLGRNATLNTTIQADNSSVTLGDSRVFIDKNDG +QGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNINDIFNGGIQANNSTVNISSDSAVLGNSTLTST +ALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEVSHTLLPVYDYAGSWNLKGDDARLNVGPYSML +SGNINVQDKGTVTLGGEGELSPDLTLQNQMLYSLFNGYRNIWSGSLNAPDATVSMTDTQWSMNGNSTAG +NMKLNRTIVGFNGGTSPFTTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNK +DTLDIPLVSAPEATADNLFRASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHI +SYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGV +MATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYA +GAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWS +LTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTD +DAINANIRYSF"; my $smallstring="GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLN +GSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSG +AYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNW +NDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRK +DSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINANIRYSFLE"; my $lenBig = length $bigstring; my $lenSmall = length $smallstring; my $threshold = 0.9; for my $o ( 0 .. $lenBig - $threshold*$lenSmall ) { my $masked = substr( $bigstring, $o, $lenSmall ) ^ $smallstring; my $matched = $masked =~ tr[\0][]; if( ( $matched / $lenSmall ) > $threshold ) { $masked =~ tr[\1-\255][ ]; $masked =~ tr[\0][*]; printf "%.2f%% match at offset %u\n", $matched / $lenSmall * 1 +00, $o; print substr $bigstring, $o, $lenSmall; print $smallstring; print $masked; } }
        98.05% match at offset 1071 YQVARNDGQGKAAATFMHISYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSF GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ +MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD +LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL +VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN +TRLGLEVERSAFGKYNTDDAINANIRYSFLE ************************* ***************************************** +********************************************************************* +********************************************************************* +********************************************************************* +*****************************
Re: Tips on how to perform this regex query
by oiskuu (Hermit) on Jan 11, 2014 at 11:23 UTC

    Technically, your search strings are not regular expressions, nor is it sensible to approach this as a regex problem.

    This looks like a Protein sequence alignment problem. Use the tools of the trade. Or do you plan to research new algorithms?

Re: Tips on how to perform this regex query
by roboticus (Chancellor) on Jan 11, 2014 at 10:23 UTC

      Sorry roboticus, but what do you think running one of those 'edit distance' algorithms on the two strings will actually tell the OP?

      To save time, I'll tell you. Ta-dah:

      1073!

      Which means what?

      My best guess:

      The best interpretation is that is a fuzzy measure of the difference in length of the two strings (actually 1069).

      Which mean that a simple: print length( $bigstring ) - length( $smallstring ); would be more accurate and about a million times faster.

      It doesn't tell him how different they are, nor where the best match occurs,

      Not even if the shorter actually appears in the larger to any meaningful measure of the term.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Tips on how to perform this regex query
by Laurent_R (Canon) on Jan 11, 2014 at 11:08 UTC

    Jarkko Hietaniemi's String::Approx module is certainly your best bet to start with. In addition to the links provided by Roboticus, the following modules might provide some ideas of algorithms that could be used: Text::Metaphone and Text::Soundex.

    Chapter 9 ("Strings") of Algorithms with Perl (Orwant, Hietaniemi, Macdonald) covers in some details some of the techniques that may be used. Otherwise, Simon Cozen's Advanced Perl Programming and John Orwant's Computer Science & Perl Programming both have a chapter briefly covering related techniques. (All three books are published by O'Reilly.)

Re: Tips on how to perform this regex query (100% matched)
by BrowserUk (Patriarch) on Jan 11, 2014 at 23:47 UTC

    Hm. I suspect that your "more than 90% of the small string is included in the bi[g] string." can not be arrived at by 'wildcarding' some number of characters in your small string -- as is implied by your "with some letters of it missing..." suggestion.

    It can only be achieved by breaking your small string into chunks and locating each (or most) of those chunks within the bigstring disregarding order.

    Thus, I arrived at the following code which finds the entirety of the small string (100% of it) in the bigstring, in six discrete chunks out of order:

    #! perl -slw use strict; my $bigstring="MNRIYSLRYSAVARGFIAVSEFARKCVHKSVRRLCFPVLLLIPVLFSAGSLAGTV +NNELGYQLFRDFAENKGMFRPGATNIAIYNKQGEFVGTLDKAAMPDFSAVDSEIGVATLINPQYIASVK +HNGGYTNVSFGDGENRYNIVDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPVFYRL +GSGTQYIKDSNGQLTKMGGAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSP +LFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTG +AGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAG +IVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIA +SGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVLANNVDKRATITLDYALRADK +VALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNATWEFVGHSQGDAQKLVADRF +NTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADISGTFTQENGRLTLQGHPVIHAYNTQSVADKLA +ASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGLGRNATLNTTIQADNSSVTLGDSRVFIDKNDG +QGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNINDIFNGGIQANNSTVNISSDSAVLGNSTLTST +ALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEVSHTLLPVYDYAGSWNLKGDDARLNVGPYSML +SGNINVQDKGTVTLGGEGELSPDLTLQNQMLYSLFNGYRNIWSGSLNAPDATVSMTDTQWSMNGNSTAG +NMKLNRTIVGFNGGTSPFTTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNK +DTLDIPLVSAPEATADNLFRASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHI +SYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGV +MATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYA +GAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWS +LTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTD +DAINANIRYSF"; my $smallstring="GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLN +GSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSG +AYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNW +NDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRK +DSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINANIRYSFLE"; my $lenBig = length $bigstring; my $lenSmall = length $smallstring; my $o = 0; WHILE: while( $o < $lenSmall ) { my $p; for my $l ( reverse 1 .. ( $lenSmall - $o ) ) { my $ss = substr( $smallstring, $o, $l ); if( $p = 1+index( $bigstring, $ss ) ) { print "Found '$ss'($o:$l) at $p"; $o += $l; next WHILE; } } ++$o; }

    The output from which is:

    C:\test>1070240.pl Found 'GT'(0:2) at 53 Found 'MA'(2:2) at 1160 Found 'RNDGQGKAAATFMHISYNNFITEV'(4:24) at 1076 Found 'DNL'(28:3) at 419 Found 'NKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTD +TDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGY +RYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAG +LHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINAN +IRYSF'(31:275) at 1103 Found 'LE'(306:2) at 322

    You'll probably want to exclude several of the smaller chunks because they are found at offsets that overlap where the two larger chunks are located. And that may be where you get your 90% from.

    Removing/ignoring small chunks that overlap larger chunks is relatively easy; until you start to think about what to do if a smaller chunk only partially overlaps a larger. Then things start to get complicated.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Tips on how to perform this regex query
by LanX (Saint) on Jan 11, 2014 at 13:46 UTC
    Sorry - as others already pointed out - your question is imprecise.

    What exactly does "missing" or "90%" mean?

    Have single letters been

    • deleted,
    • replaced or
    • inserted

    to one of the strings?

    Does order matter?

    What's the alphabet and are all characters equally interchangeable?

    How is the match x% calculated?

    Please show a concise example for all cases and give us estimates about the normal size of the data in question.

    And I'll promise you an equally good answer...

    HTH! :)

    Cheers Rolf

    ( addicted to the Perl Programming Language)