in reply to Tips on how to perform this regex query

Hm. I suspect that your "more than 90% of the small string is included in the bi[g] string." can not be arrived at by 'wildcarding' some number of characters in your small string -- as is implied by your "with some letters of it missing..." suggestion.

It can only be achieved by breaking your small string into chunks and locating each (or most) of those chunks within the bigstring disregarding order.

Thus, I arrived at the following code which finds the entirety of the small string (100% of it) in the bigstring, in six discrete chunks out of order:

#! perl -slw use strict; my $bigstring="MNRIYSLRYSAVARGFIAVSEFARKCVHKSVRRLCFPVLLLIPVLFSAGSLAGTV +NNELGYQLFRDFAENKGMFRPGATNIAIYNKQGEFVGTLDKAAMPDFSAVDSEIGVATLINPQYIASVK +HNGGYTNVSFGDGENRYNIVDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPVFYRL +GSGTQYIKDSNGQLTKMGGAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSP +LFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTG +AGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAG +IVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIA +SGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVLANNVDKRATITLDYALRADK +VALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNATWEFVGHSQGDAQKLVADRF +NTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADISGTFTQENGRLTLQGHPVIHAYNTQSVADKLA +ASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGLGRNATLNTTIQADNSSVTLGDSRVFIDKNDG +QGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNINDIFNGGIQANNSTVNISSDSAVLGNSTLTST +ALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEVSHTLLPVYDYAGSWNLKGDDARLNVGPYSML +SGNINVQDKGTVTLGGEGELSPDLTLQNQMLYSLFNGYRNIWSGSLNAPDATVSMTDTQWSMNGNSTAG +NMKLNRTIVGFNGGTSPFTTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNK +DTLDIPLVSAPEATADNLFRASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHI +SYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGV +MATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYA +GAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWS +LTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTD +DAINANIRYSF"; my $smallstring="GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLN +GSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSG +AYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNW +NDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRK +DSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINANIRYSFLE"; my $lenBig = length $bigstring; my $lenSmall = length $smallstring; my $o = 0; WHILE: while( $o < $lenSmall ) { my $p; for my $l ( reverse 1 .. ( $lenSmall - $o ) ) { my $ss = substr( $smallstring, $o, $l ); if( $p = 1+index( $bigstring, $ss ) ) { print "Found '$ss'($o:$l) at $p"; $o += $l; next WHILE; } } ++$o; }

The output from which is:

C:\test>1070240.pl Found 'GT'(0:2) at 53 Found 'MA'(2:2) at 1160 Found 'RNDGQGKAAATFMHISYNNFITEV'(4:24) at 1076 Found 'DNL'(28:3) at 419 Found 'NKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTD +TDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGY +RYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAG +LHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINAN +IRYSF'(31:275) at 1103 Found 'LE'(306:2) at 322

You'll probably want to exclude several of the smaller chunks because they are found at offsets that overlap where the two larger chunks are located. And that may be where you get your 90% from.

Removing/ignoring small chunks that overlap larger chunks is relatively easy; until you start to think about what to do if a smaller chunk only partially overlaps a larger. Then things start to get complicated.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.