More efficient way for this pattern match?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I have the following 2 strings:

$str1='MCCAALAPPMAATVGPESIWLWIGTIGMTLGTLYFVGRGRGVRDRKMQEFYIITIFITTIAAA
+MYFAMATGFGVT-------------EVMVG----DE---ALTIYWARYADWLFTTPLLLLDLSLLAGAN
+RN----TIATLIG-LDVFMIG---T---GAIAALSST-PGTRIAWWAIST--GALL--ALLYVLVGTLS
+ENARNRAPEVA--SLFGRLRNLVIALWFLYPVVWILGT---EGTFGILP--LYWETAAFMVLDLSAKVG
+FGVILLQSRSVLERVATPTAAPT';

$str2='--OOOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMMIIIIIIIIIIMMMMMMMMMMMMMM
+MMMMMMMOOOOO-------------OOOOO----OO---OOOOMMMMMMMMMMMMMMMMMMMMMIIIII
+II----MMMMMMM-MMMMMMM---M---MMMMMMOOO-OOOOMMMMMMMM--MMMM--MMMMMMMMMMI
+IIIIIIIIIII--IIIIMMMMMMMMMMMMMMMMMMMMO---OOO-OOOO--OOOMMMMMMMMMMMMMMM
+MMMMMMIIIIIIIIIIIII----';
[download]

They are of the same length. What I need to do is, for each position in $str2, if in the respective position of $str1 there is a letter and not a '-', replace the '-' (in $str2) with the letter that there is on its right (or on its left if there is no right position). For example, the first 2 positions in $str2 are -, but, we see that in $str1 there are letters. Therefore, these first 2 positions should be replaced by 'O'. The same goes for the last positions in $str2, which are -, and should be replaced by 'I' (since there is no letter on their right side). There is also one position in $str2 (#246) which corresponds to a letter in $str1 (F), thus this should be changed also to 'O' (since 'O' is on position #247 of $str2.
I know I must use the split function, but I can't find the logic behind it...

Comment on More efficient way for this pattern match? Select or Download Code

Replies are listed 'Best First'.
Re: More efficient way for this pattern match? by GrandFather (Saint) on Apr 14, 2015 at 01:40 UTC
No, the split function is not a good way to do this trick. A regular expression is a better bet: #!/usr/bin/perl use strict; use warnings; use File::Copy; my $str1='MCCAALAPPMAATVGPESIWLWIGTIGMTLGTLYFVGRGRGVRDRKMQEFYIITIFITTI +AAAMYFAMATGFGVT-------------EVMVG----DE---ALTIYWARYADWLFTTPLLLLDLSLLA +GANRN----TIATLIG-LDVFMIG---T---GAIAALSST-PGTRIAWWAIST--GALL--ALLYVLVG +TLSENARNRAPEVA--SLFGRLRNLVIALWFLYPVVWILGT---EGTFGILP--LYWETAAFMVLDLSA +KVGFGVILLQSRSVLERVATPTAAPT'; my $str2='--OOOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMMIIIIIIIIIIMMMMMMMMMMM +MMMMMMMMMMOOOOO-------------OOOOO----OO---OOOOMMMMMMMMMMMMMMMMMMMMMII +IIIII----MMMMMMM-MMMMMMM---M---MMMMMMOOO-OOOOMMMMMMMM--MMMM--MMMMMMMM +MMIIIIIIIIIIII--IIIIMMMMMMMMMMMMMMMMMMMMO---OOO-OOOO--OOOMMMMMMMMMMMM +MMMMMMMMMIIIIIIIIIIIII----'; while ($str2 =~ /(-+)/g) { my ($start, $end) = ($-[0], $+[0]); my $matchLen = $end - $start; next if substr($str1, $start, $matchLen) =~ /^-+$/; my $chIdx = $end == length($str2) ? $start - 1 : $end; substr ($str2, $start, $matchLen, substr($str2, $chIdx, 1) x $matc +hLen); } print $str2; [download] Prints: `OOOOOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMMIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMM +OOOOO-------------OOOOO----OO---OOOOMMMMMMMMMMMMMMMMMMMMMIIIIIII----M +MMMMMM-MMMMMMM---M---MMMMMMOOO-OOOOMMMMMMMM--MMMM--MMMMMMMMMMIIIIIIII +IIII--IIIIMMMMMMMMMMMMMMMMMMMMO---OOOOOOOO--OOOMMMMMMMMMMMMMMMMMMMMMI +IIIIIIIIIIIIIIII` [download] Perl is the programming world's equivalent of English	[reply] [d/l] [select]
Re^2: More efficient way for this pattern match? by ikegami (Patriarch) on Apr 14, 2015 at 03:00 UTC
Modifying `$str2` is resetting `pos($str2)`, so you're doing a lot of unneeded work. Adding `pos($str2) = $end;` at the end of the loop addresses this issue.	[reply] [d/l] [select]
Re^2: More efficient way for this pattern match? by hdb (Monsignor) on Apr 14, 2015 at 12:08 UTC
Your script seems to make some additional assumptions that I cannot find in the original question. For example, for `my $str1='--M--CCA'; my $str2='-----OOO';` [download] your script prints `OOOOOOOO` while I would have thought it should be `--O--OOO`? UPDATE: In order to avoid any confusion, the strings above are NOT part of the original question but examples I constructed assuming they could occur. The purpose of this was to highlight a situation where the proposed script would deliver something that violates the original requirements. Not sure the example is really relevant.	[reply] [d/l] [select]
Re^3: More efficient way for this pattern match? by GrandFather (Saint) on Apr 14, 2015 at 21:08 UTC
Good catch. It's a bug. The line: `next if substr($str1, $start, $matchLen) =~ /^-+$/;` [download] should be more like (untested): `next if ! matchLen \|\| substr($str1, $start, $matchLen) !~ /[^-]/;` [download] Perl is the programming world's equivalent of English	[reply] [d/l] [select]
Re^2: More efficient way for this pattern match? by Anonymous Monk on Apr 14, 2015 at 07:14 UTC
Thanks so much!	[reply]
Re^2: More efficient way for this pattern match? by Anonymous Monk on Apr 14, 2015 at 06:46 UTC
No, the split function is not a good way to do this trick. A regular expression is a better bet: FWIW, split function , same as match operator, both take a regular expression ;)	[reply]
Re: More efficient way for this pattern match? by Anonymous Monk on Apr 14, 2015 at 14:44 UTC
#!/usr/bin/perl # http://perlmonks.org/?node_id=1123341 use strict; use warnings; my $str1='MCCAALAPPMAATVGPESIWLWIGTIGMTLGTLYFVGRGRGVRDRKMQEFYIITIFITTI +AAAMYFAMATGFGVT-------------EVMVG----DE---ALTIYWARYADWLFTTPLLLLDLSLLA +GANRN----TIATLIG-LDVFMIG---T---GAIAALSST-PGTRIAWWAIST--GALL--ALLYVLVG +TLSENARNRAPEVA--SLFGRLRNLVIALWFLYPVVWILGT---EGTFGILP--LYWETAAFMVLDLSA +KVGFGVILLQSRSVLERVATPTAAPT'; my $str2='--OOOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMMIIIIIIIIIIMMMMMMMMMMM +MMMMMMMMMMOOOOO-------------OOOOO----OO---OOOOMMMMMMMMMMMMMMMMMMMMMII +IIIII----MMMMMMM-MMMMMMM---M---MMMMMMOOO-OOOOMMMMMMMM--MMMM--MMMMMMMM +MMIIIIIIIIIIII--IIIIMMMMMMMMMMMMMMMMMMMMO---OOO-OOOO--OOOMMMMMMMMMMMM +MMMMMMMMMIIIIIIIIIIIII----'; my ($last) = $str2 =~ /.(\w)/; # find letter for ending -'s tr/\0-/-=/, s/=(?=\W(\w?))/ $1 \|\| $last /ge for my $answer = $str1 =~ tr/A-Z/\0/r ^ $str2; print "$answer\n"; [download] Explanation provided on request :)	[reply] [d/l]
Re: More efficient way for this pattern match? by Anonymous Monk on Apr 14, 2015 at 19:41 UTC
See also this thread for a very similar, though not identical, problem. Also, Hello! to your perl teacher. #! /usr/bin/perl sub ano1 { my ($str1, $str2) = @_; my ($last) = $str2 =~ /.(\w)/; tr/\0-/-=/, s/=(?=\W(\w?))/ $1 \|\| $last /ge for my $answer = $str1 =~ tr/A-Z/\0/r ^ $str2; $answer; } sub ano2 { my ($x, $y) = @_; $y &= ($y =~ tr/-\0-\377/\0\377/r) \| ($x =~ tr/-\0-\377/\377\0/r); $y =~ s/(\0+)(.)/$2 x (1 + length $1)/eg; $y =~ s/(.)(\0+)$/$1 x (1 + length $2)/er; } chomp (our ($str1, $str2) = <DATA>); use Benchmark 'cmpthese'; cmpthese -5, { ano1 => q( ano1 $str1, $str2 ), ano2 => q( ano2 $str1, $str2 ), }; __DATA__ MCCAALAPPMAATVGPESIWLWIGTIGMTLGTLYFVGRGRGVRDRKMQEFYIITIFITTIAAAMYFAMAT +GFGVT-------------EVMVG----DE---ALTIYWARYADWLFTTPLLLLDLSLLAGANRN----T +IATLIG-LDVFMIG---T---GAIAALSST-PGTRIAWWAIST--GALL--ALLYVLVGTLSENARNRA +PEVA--SLFGRLRNLVIALWFLYPVVWILGT---EGTFGILP--LYWETAAFMVLDLSAKVGFGVILLQ +SRSVLERVATPTAAPT --OOOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMMIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMM +OOOOO-------------OOOOO----OO---OOOOMMMMMMMMMMMMMMMMMMMMMIIIIIII----M +MMMMMM-MMMMMMM---M---MMMMMMOOO-OOOOMMMMMMMM--MMMM--MMMMMMMMMMIIIIIIII +IIII--IIIIMMMMMMMMMMMMMMMMMMMMO---OOO-OOOO--OOOMMMMMMMMMMMMMMMMMMMMMI +IIIIIIIIIIII---- [download]	[reply] [d/l]
Re^2: More efficient way for this pattern match? by Anonymous Monk on Apr 15, 2015 at 12:07 UTC
Sigh, this is too crazy. I give up, and I'll go back to golf! This runs faster as a one-liner than as multiple lines. :( Also tr/A-Z/\0/r is faster than tr/-/\0/cr It makes no sense :( `sub ano1 { (pop ^ pop =~ tr/A-Z/\0/r) =~ tr/\0-/-=/r =~ s/(=+(\w))/$2 +x length $1/ger =~ s/((.)=+)$/$2 x length $1/er }` [download]	[reply] [d/l]