Re: Tips on how to perform this regex query
by BrowserUk (Patriarch) on Jan 11, 2014 at 10:44 UTC
|
where NOT all, but more than 90% of the small string is included in the bi string
Hm. Your sample does not demonstrate that. The best match is 12.01% (at offset 575).
The simplest (and fastest pure Perl) method is to use bitwise-xor masking at each of the possible offsets and the count the number of nulls (zero-bytes) in the result. Here is code that demonstrates that:
#! perl -slw
use strict;
my $bigstring="MNRIYSLRYSAVARGFIAVSEFARKCVHKSVRRLCFPVLLLIPVLFSAGSLAGTV
+NNELGYQLFRDFAENKGMFRPGATNIAIYNKQGEFVGTLDKAAMPDFSAVDSEIGVATLINPQYIASVK
+HNGGYTNVSFGDGENRYNIVDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPVFYRL
+GSGTQYIKDSNGQLTKMGGAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSP
+LFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTG
+AGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAG
+IVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIA
+SGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVLANNVDKRATITLDYALRADK
+VALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNATWEFVGHSQGDAQKLVADRF
+NTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADISGTFTQENGRLTLQGHPVIHAYNTQSVADKLA
+ASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGLGRNATLNTTIQADNSSVTLGDSRVFIDKNDG
+QGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNINDIFNGGIQANNSTVNISSDSAVLGNSTLTST
+ALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEVSHTLLPVYDYAGSWNLKGDDARLNVGPYSML
+SGNINVQDKGTVTLGGEGELSPDLTLQNQMLYSLFNGYRNIWSGSLNAPDATVSMTDTQWSMNGNSTAG
+NMKLNRTIVGFNGGTSPFTTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNK
+DTLDIPLVSAPEATADNLFRASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHI
+SYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGV
+MATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYA
+GAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWS
+LTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTD
+DAINANIRYSF";
my $smallstring="GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLN
+GSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSG
+AYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNW
+NDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRK
+DSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINANIRYSFLE";
my $lenBig = length $bigstring;
my $lenSmall = length $smallstring;
for my $o ( 0 .. ( $lenBig - $lenSmall + 1 ) ) {
my $masked = substr( $bigstring, $o, $lenSmall ) ^ $smallstring;
my $matched = $masked =~ tr[\0][];
if( ( $matched / $lenSmall ) > 0.10 ) {
$masked =~ tr[\1-\255][ ];
$masked =~ tr[\0][*];
printf "%.2f%% match at offset %u\n", $matched / $lenSmall * 1
+00, $o;
print substr $bigstring, $o, $lenSmall;
print $smallstring;
print $masked;
}
}
The output: C:\test>1070240.pl
10.06% match at offset 50
LAGTVNNELGYQLFRDFAENKGMFRPGATNIAIYNKQGEFVGTLDKAAMPDFSAVDSEIGVATLINPQYI
+ASVKHNGGYTNVSFGDGENRYNIVDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPV
+FYRLGSGTQYIKDSNGQLTKMGGAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGD
+SGSPLFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFN
+SSTGAGALTQGTTTYAMHGQQGNDLNAGKNL
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * * * * *
+ * * * * * *
+ * * * * ** * * *
+ * ** * * * *
+ * *
10.06% match at offset 143
VDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPVFYRLGSGTQYIKDSNGQLTKMGGA
+YSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSPLFAFDTVQNKWVLVGVLTAG
+NGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTGAGALTQGTTTYAMHGQQGND
+LNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAGIVVDNGVSVNWQVNGVKGDN
+LHKIGEGTLTVQGTGINEGGLKVGDGKVVLN
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * * * *
+ ** * * * **
+ * * * * * * *
+ * * * * * ***
+ * * *
10.71% match at offset 211
GAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSPLFAFDTVQNKWVLVGVLTA
+GNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTGAGALTQGTTTYAMHGQQGN
+DLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAGIVVDNGVSVNWQVNGVKGD
+NLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIASGRPTVVLTDERQVNPDTV
+SWGYRGGTLDVNGNSLTFHQLKAADYGAVLA
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * **** * *
+ * * ** *
+ ** * * * * * * *
+ * * * ** *
+ * * * *
11.04% match at offset 227
YQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSPLFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPL
+DFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTGAGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQ
+INLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAGIVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTG
+INEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIASGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSL
+TFHQLKAADYGAVLANNVDKRATITLDYALR
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * * * *
+ * * * * * * *
+ * * * * * * * *
+ * * * * * * * *
+* * * * *
11.69% match at offset 276
GVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTGAGALTQGTTTYAMHG
+QQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAGIVVDNGVSVNWQVNG
+VKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIASGRPTVVLTDERQVN
+PDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVLANNVDKRATITLDYALRADKVALNGWSESGKGTAG
+NLYKYNNPYTNTTDYFILKQSTYGYFPTDQS
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * * * * * * * * * *
+ * * * ** * *
+ * * * * * * **
+ * * * * * * * *
+ *
11.04% match at offset 310
PVTFRTSEGGALEWSFNSSTGAGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSL
+TFRDNYTVTTSNGSTWTGAGIVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVV
+LNQQADNKGQVQAFSSVNIASGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVL
+ANNVDKRATITLDYALRADKVALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNA
+TWEFVGHSQGDAQKLVADRFNTAGYLFHGQL
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * * *
+ * * * * * * *
+** * * * *** * * * * * *
+ * * **
+* * * *
10.06% match at offset 536
DKVALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNATWEFVGHSQGDAQKLVADR
+FNTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADISGTFTQENGRLTLQGHPVIHAYNTQSVADKL
+AASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGLGRNATLNTTIQADNSSVTLGDSRVFIDKND
+GQGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNINDIFNGGIQANNSTVNISSDSAVLGNSTLTS
+TALNLNKGANALASQSFVSDGPVNISDATLS
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * * *
+ * * * * * * *
+ * * * * *
+ * * * * * * *
+* * * * ** *
12.01% match at offset 575
YGYFPTDQSSNATWEFVGHSQGDAQKLVADRFNTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADIS
+GTFTQENGRLTLQGHPVIHAYNTQSVADKLAASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGL
+GRNATLNTTIQADNSSVTLGDSRVFIDKNDGQGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNIN
+DIFNGGIQANNSTVNISSDSAVLGNSTLTSTALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEV
+SHTLLPVYDYAGSWNLKGDDARLNVGPYSML
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * * * * * * * *
+ * * * ** * * *
+ * * * * * *
+ * * * * * ** *
+ * * * * *
10.06% match at offset 813
TALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEVSHTLLPVYDYAGSWNLKGDDARLNVGPYSML
+SGNINVQDKGTVTLGGEGELSPDLTLQNQMLYSLFNGYRNIWSGSLNAPDATVSMTDTQWSMNGNSTAG
+NMKLNRTIVGFNGGTSPFTTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNK
+DTLDIPLVSAPEATADNLFRASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHI
+SYNNFITEVNNLNKRMGDLRDINGEAGTWVR
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * * * *
+ * * * * ***
+ * ** *
+ * * * * * * * * * * *
+ * * *
11.69% match at offset 970
TTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNKDTLDIPLVSAPEATADNLF
+RASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHISYNNFITEVNNLNKRMGDL
+RDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTK
+SWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQ
+AELVWGRLQGQTFNWNDSGMDVSMRRNSVNP
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * * * * * *
+ * * * * * * * *
+ * * * * * * * * *
+ * * * * * * *
+ * * * *
10.06% match at offset 1054
ILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHISYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLN
+GSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSG
+AYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNW
+NDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRK
+DSRMLYGVGLNARFGDNTRLGLEVERSAFGK
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
* * * * * **
+ * * ** * * *
+ * * * **
+ ** * * * * * * *
+ * * *
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
It depends on whether you count deletion and insertion as possible operations, too. But if you do not, xor is a good trick.
| [reply] |
|
|
| [reply] |
|
|
|
|
|
|
|
my $lenSmall = length $smallstring;
for my $o ( 0 .. $lenBig - 1 ) {
my $masked = substr( $bigstring, $o, $lenSmall ) ^ $smallstring;
my $matched = $masked =~ tr[\0][];
if( ( $matched / $lenSmall ) > 0.90 ) {
$masked =~ tr[\1-\255][ ];
$masked =~ tr[\0][*];
printf "%.2f%% match at offset %u\n", $matched / $lenSmall * 1
+00, $o;
print substr $bigstring, $o, $lenSmall;
print $smallstring;
print $masked;
}
}
98.05% match at offset 1071
YQVARNDGQGKAAATFMHISYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSF
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
************************* *****************************************
+*********************************************************************
+*********************************************************************
+*********************************************************************
+*****************************
| [reply] [d/l] [select] |
|
|
c:\test\anonymonks.pl
Global symbol "$lenBig" requires explicit package name at C:\test\1070
+240.pl line 9.
Execution of C:\test\1070240.pl aborted due to compilation errors.
Correct for that and compare:my $bigstring="MNRIYSLRYSA...";
my $smallstring="GTMARNDGQGKAA...";
my $lenBig = length $bigstring;
my $lenSmall = length $smallstring;
for my $o ( 0 .. ( $lenBig - $lenSmall + 1 ) ) {
my $masked = substr( $bigstring, $o, $lenSmall ) ^ $smallstring;
my $matched = $masked =~ tr[\0][];
if( ( $matched / $lenSmall ) > 0.095 ) {
$masked =~ tr[\1-\255][ ];
$masked =~ tr[\0][*];
printf "%.2f%% match at offset %u\n", $matched / $lenSmall * 1
+00, $o;
print substr $bigstring, $o, $lenSmall;
print $smallstring;
print $masked;
}
}
];
$otherstring = q[
my $bigstring="MNRIYSLRYSA...";
my $smallstring="GTMARNDGQGKAA...";
my $lenBig = length $bigstring;
my $lenSmall = length $smallstring;
for my $o ( 0 .. $lenBig - 1 ) {
my $masked = substr( $bigstring, $o, $lenSmall ) ^ $smallstring;
my $matched = $masked =~ tr[\0][];
if( ( $matched / $lenSmall ) > 0.90 ) {
$masked =~ tr[\1-\255][ ];
$masked =~ tr[\0][*];
printf "%.2f%% match at offset %u\n", $matched / $lenSmall * 1
+00, $o;
print substr $bigstring, $o, $lenSmall;
print $smallstring;
print $masked;
}
}
];
print compare( $firststring, $otherstring );
__END__
99.4% plagerised
Original thought is a rare commodity.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
#! perl -slw
use strict;
my $bigstring="MNRIYSLRYSAVARGFIAVSEFARKCVHKSVRRLCFPVLLLIPVLFSAGSLAGTV
+NNELGYQLFRDFAENKGMFRPGATNIAIYNKQGEFVGTLDKAAMPDFSAVDSEIGVATLINPQYIASVK
+HNGGYTNVSFGDGENRYNIVDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPVFYRL
+GSGTQYIKDSNGQLTKMGGAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSP
+LFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTG
+AGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAG
+IVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIA
+SGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVLANNVDKRATITLDYALRADK
+VALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNATWEFVGHSQGDAQKLVADRF
+NTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADISGTFTQENGRLTLQGHPVIHAYNTQSVADKLA
+ASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGLGRNATLNTTIQADNSSVTLGDSRVFIDKNDG
+QGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNINDIFNGGIQANNSTVNISSDSAVLGNSTLTST
+ALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEVSHTLLPVYDYAGSWNLKGDDARLNVGPYSML
+SGNINVQDKGTVTLGGEGELSPDLTLQNQMLYSLFNGYRNIWSGSLNAPDATVSMTDTQWSMNGNSTAG
+NMKLNRTIVGFNGGTSPFTTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNK
+DTLDIPLVSAPEATADNLFRASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHI
+SYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGV
+MATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYA
+GAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWS
+LTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTD
+DAINANIRYSF";
my $smallstring="GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLN
+GSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSG
+AYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNW
+NDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRK
+DSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINANIRYSFLE";
my $lenBig = length $bigstring;
my $lenSmall = length $smallstring;
my $threshold = 0.9;
for my $o ( 0 .. $lenBig - $threshold*$lenSmall ) {
my $masked = substr( $bigstring, $o, $lenSmall ) ^ $smallstring;
my $matched = $masked =~ tr[\0][];
if( ( $matched / $lenSmall ) > $threshold ) {
$masked =~ tr[\1-\255][ ];
$masked =~ tr[\0][*];
printf "%.2f%% match at offset %u\n", $matched / $lenSmall * 1
+00, $o;
print substr $bigstring, $o, $lenSmall;
print $smallstring;
print $masked;
}
}
98.05% match at offset 1071
YQVARNDGQGKAAATFMHISYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSF
GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQ
+MGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYD
+LNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPL
+VGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDN
+TRLGLEVERSAFGKYNTDDAINANIRYSFLE
************************* *****************************************
+*********************************************************************
+*********************************************************************
+*********************************************************************
+*****************************
| [reply] [d/l] [select] |
|
|
Re: Tips on how to perform this regex query
by oiskuu (Hermit) on Jan 11, 2014 at 11:23 UTC
|
Technically, your search strings are not regular expressions, nor is it sensible to approach this as a regex problem.
This looks like a Protein sequence alignment problem. Use the tools of the trade. Or do you plan to research new algorithms?
| [reply] |
Re: Tips on how to perform this regex query
by roboticus (Chancellor) on Jan 11, 2014 at 10:23 UTC
|
| [reply] |
|
|
Sorry roboticus, but what do you think running one of those 'edit distance' algorithms on the two strings will actually tell the OP?
To save time, I'll tell you. Ta-dah: 1073!
Which means what?
My best guess:
The best interpretation is that is a fuzzy measure of the difference in length of the two strings (actually 1069).
Which mean that a simple: print length( $bigstring ) - length( $smallstring ); would be more accurate and about a million times faster.
It doesn't tell him how different they are, nor where the best match occurs,
Not even if the shorter actually appears in the larger to any meaningful measure of the term.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Tips on how to perform this regex query
by Laurent_R (Canon) on Jan 11, 2014 at 11:08 UTC
|
Jarkko Hietaniemi's String::Approx module is certainly your best bet to start with. In addition to the links provided by Roboticus, the following modules might provide some ideas of algorithms that could be used: Text::Metaphone and Text::Soundex.
Chapter 9 ("Strings") of Algorithms with Perl (Orwant, Hietaniemi, Macdonald) covers in some details some of the techniques that may be used. Otherwise, Simon Cozen's Advanced Perl Programming and John Orwant's Computer Science & Perl Programming both have a chapter briefly covering related techniques. (All three books are published by O'Reilly.)
| [reply] |
Re: Tips on how to perform this regex query (100% matched)
by BrowserUk (Patriarch) on Jan 11, 2014 at 23:47 UTC
|
Hm. I suspect that your "more than 90% of the small string is included in the bi[g] string." can not be arrived at by 'wildcarding' some number of characters in your small string -- as is implied by your "with some letters of it missing..." suggestion.
It can only be achieved by breaking your small string into chunks and locating each (or most) of those chunks within the bigstring disregarding order.
Thus, I arrived at the following code which finds the entirety of the small string (100% of it) in the bigstring, in six discrete chunks out of order:
#! perl -slw
use strict;
my $bigstring="MNRIYSLRYSAVARGFIAVSEFARKCVHKSVRRLCFPVLLLIPVLFSAGSLAGTV
+NNELGYQLFRDFAENKGMFRPGATNIAIYNKQGEFVGTLDKAAMPDFSAVDSEIGVATLINPQYIASVK
+HNGGYTNVSFGDGENRYNIVDRNNAPSLDFHAPRLDKLVTEVAPTAVTAQGAVAGAYLDKERYPVFYRL
+GSGTQYIKDSNGQLTKMGGAYSWLTGGTVGSLSSYQNGEMISTSSGLVFDYKLNGAMPIYGEAGDSGSP
+LFAFDTVQNKWVLVGVLTAGNGAGGRGNNWAVIPLDFIGQKFNEDNDAPVTFRTSEGGALEWSFNSSTG
+AGALTQGTTTYAMHGQQGNDLNAGKNLIFQGQNGQINLKDSVSQGAGSLTFRDNYTVTTSNGSTWTGAG
+IVVDNGVSVNWQVNGVKGDNLHKIGEGTLTVQGTGINEGGLKVGDGKVVLNQQADNKGQVQAFSSVNIA
+SGRPTVVLTDERQVNPDTVSWGYRGGTLDVNGNSLTFHQLKAADYGAVLANNVDKRATITLDYALRADK
+VALNGWSESGKGTAGNLYKYNNPYTNTTDYFILKQSTYGYFPTDQSSNATWEFVGHSQGDAQKLVADRF
+NTAGYLFHGQLKGNLNVDNRLPEGVTGALVMDGAADISGTFTQENGRLTLQGHPVIHAYNTQSVADKLA
+ASGDHSVLTQPTSFSQEDWENRSFTFDRLSLKNTDFGLGRNATLNTTIQADNSSVTLGDSRVFIDKNDG
+QGTAFTLEEGTSVATKDADKSVFNGTVNLDNQSVLNINDIFNGGIQANNSTVNISSDSAVLGNSTLTST
+ALNLNKGANALASQSFVSDGPVNISDATLSLNSRPDEVSHTLLPVYDYAGSWNLKGDDARLNVGPYSML
+SGNINVQDKGTVTLGGEGELSPDLTLQNQMLYSLFNGYRNIWSGSLNAPDATVSMTDTQWSMNGNSTAG
+NMKLNRTIVGFNGGTSPFTTLTTDNLDAVQSAFVMRTDLNKADKLVINKSATGHDNSIWVNFLKKPSNK
+DTLDIPLVSAPEATADNLFRASTRVVGFSDVTPILSVRKEDGKKEWVLDGYQVARNDGQGKAAATFMHI
+SYNNFITEVNNLNKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGV
+MATYTDTDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYA
+GAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWS
+LTARAGLHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTD
+DAINANIRYSF";
my $smallstring="GTMARNDGQGKAAATFMHISYNNFITEVDNLNKRMGDLRDINGEAGTWVRLLN
+GSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTDTDASADLYSGKTKSWGGGFYASGLFRSG
+AYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGYRYHLTDTTFVEPQAELVWGRLQGQTFNW
+NDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAGLHYEFDLTDSADVHLKDAAGEHQINGRK
+DSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINANIRYSFLE";
my $lenBig = length $bigstring;
my $lenSmall = length $smallstring;
my $o = 0;
WHILE: while( $o < $lenSmall ) {
my $p;
for my $l ( reverse 1 .. ( $lenSmall - $o ) ) {
my $ss = substr( $smallstring, $o, $l );
if( $p = 1+index( $bigstring, $ss ) ) {
print "Found '$ss'($o:$l) at $p";
$o += $l;
next WHILE;
}
}
++$o;
}
The output from which is: C:\test>1070240.pl
Found 'GT'(0:2) at 53
Found 'MA'(2:2) at 1160
Found 'RNDGQGKAAATFMHISYNNFITEV'(4:24) at 1076
Found 'DNL'(28:3) at 419
Found 'NKRMGDLRDINGEAGTWVRLLNGSGSADGGFTDHYTLLQMGADRKHELGSMDLFTGVMATYTD
+TDASADLYSGKTKSWGGGFYASGLFRSGAYFDVIAKYIHNENKYDLNFAGAGKQNFRSHSLYAGAEVGY
+RYHLTDTTFVEPQAELVWGRLQGQTFNWNDSGMDVSMRRNSVNPLVGRTGVVSGKTFSGKDWSLTARAG
+LHYEFDLTDSADVHLKDAAGEHQINGRKDSRMLYGVGLNARFGDNTRLGLEVERSAFGKYNTDDAINAN
+IRYSF'(31:275) at 1103
Found 'LE'(306:2) at 322
You'll probably want to exclude several of the smaller chunks because they are found at offsets that overlap where the two larger chunks are located. And that may be where you get your 90% from.
Removing/ignoring small chunks that overlap larger chunks is relatively easy; until you start to think about what to do if a smaller chunk only partially overlaps a larger. Then things start to get complicated.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: Tips on how to perform this regex query
by LanX (Saint) on Jan 11, 2014 at 13:46 UTC
|
Sorry - as others already pointed out - your question is imprecise.
What exactly does "missing" or "90%" mean?
Have single letters been
- deleted,
- replaced or
- inserted
to one of the strings?
Does order matter?
What's the alphabet and are all characters equally interchangeable?
How is the match x% calculated?
Please show a concise example for all cases and give us estimates about the normal size of the data in question.
And I'll promise you an equally good answer...
HTH! :)
Cheers Rolf
( addicted to the Perl Programming Language)
| [reply] |