Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I have the following 2 strings:
$str1='MCCAALAPPMAATVGPESIWLWIGTIGMTLGTLYFVGRGRGVRDRKMQEFYIITIFITTIAAA +MYFAMATGFGVT-------------EVMVG----DE---ALTIYWARYADWLFTTPLLLLDLSLLAGAN +RN----TIATLIG-LDVFMIG---T---GAIAALSST-PGTRIAWWAIST--GALL--ALLYVLVGTLS +ENARNRAPEVA--SLFGRLRNLVIALWFLYPVVWILGT---EGTFGILP--LYWETAAFMVLDLSAKVG +FGVILLQSRSVLERVATPTAAPT'; $str2='--OOOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMMIIIIIIIIIIMMMMMMMMMMMMMM +MMMMMMMOOOOO-------------OOOOO----OO---OOOOMMMMMMMMMMMMMMMMMMMMMIIIII +II----MMMMMMM-MMMMMMM---M---MMMMMMOOO-OOOOMMMMMMMM--MMMM--MMMMMMMMMMI +IIIIIIIIIII--IIIIMMMMMMMMMMMMMMMMMMMMO---OOO-OOOO--OOOMMMMMMMMMMMMMMM +MMMMMMIIIIIIIIIIIII----';

They are of the same length. What I need to do is, for each position in $str2, if in the respective position of $str1 there is a letter and not a '-', replace the '-' (in $str2) with the letter that there is on its right (or on its left if there is no right position). For example, the first 2 positions in $str2 are -, but, we see that in $str1 there are letters. Therefore, these first 2 positions should be replaced by 'O'. The same goes for the last positions in $str2, which are -, and should be replaced by 'I' (since there is no letter on their right side). There is also one position in $str2 (#246) which corresponds to a letter in $str1 (F), thus this should be changed also to 'O' (since 'O' is on position #247 of $str2.
I know I must use the split function, but I can't find the logic behind it...

Replies are listed 'Best First'.
Re: More efficient way for this pattern match?
by GrandFather (Saint) on Apr 14, 2015 at 01:40 UTC

    No, the split function is not a good way to do this trick. A regular expression is a better bet:

    #!/usr/bin/perl use strict; use warnings; use File::Copy; my $str1='MCCAALAPPMAATVGPESIWLWIGTIGMTLGTLYFVGRGRGVRDRKMQEFYIITIFITTI +AAAMYFAMATGFGVT-------------EVMVG----DE---ALTIYWARYADWLFTTPLLLLDLSLLA +GANRN----TIATLIG-LDVFMIG---T---GAIAALSST-PGTRIAWWAIST--GALL--ALLYVLVG +TLSENARNRAPEVA--SLFGRLRNLVIALWFLYPVVWILGT---EGTFGILP--LYWETAAFMVLDLSA +KVGFGVILLQSRSVLERVATPTAAPT'; my $str2='--OOOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMMIIIIIIIIIIMMMMMMMMMMM +MMMMMMMMMMOOOOO-------------OOOOO----OO---OOOOMMMMMMMMMMMMMMMMMMMMMII +IIIII----MMMMMMM-MMMMMMM---M---MMMMMMOOO-OOOOMMMMMMMM--MMMM--MMMMMMMM +MMIIIIIIIIIIII--IIIIMMMMMMMMMMMMMMMMMMMMO---OOO-OOOO--OOOMMMMMMMMMMMM +MMMMMMMMMIIIIIIIIIIIII----'; while ($str2 =~ /(-+)/g) { my ($start, $end) = ($-[0], $+[0]); my $matchLen = $end - $start; next if substr($str1, $start, $matchLen) =~ /^-+$/; my $chIdx = $end == length($str2) ? $start - 1 : $end; substr ($str2, $start, $matchLen, substr($str2, $chIdx, 1) x $matc +hLen); } print $str2;

    Prints:

    OOOOOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMMIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMM +OOOOO-------------OOOOO----OO---OOOOMMMMMMMMMMMMMMMMMMMMMIIIIIII----M +MMMMMM-MMMMMMM---M---MMMMMMOOO-OOOOMMMMMMMM--MMMM--MMMMMMMMMMIIIIIIII +IIII--IIIIMMMMMMMMMMMMMMMMMMMMO---OOOOOOOO--OOOMMMMMMMMMMMMMMMMMMMMMI +IIIIIIIIIIIIIIII
    Perl is the programming world's equivalent of English
      Modifying $str2 is resetting pos($str2), so you're doing a lot of unneeded work. Adding pos($str2) = $end; at the end of the loop addresses this issue.

      Your script seems to make some additional assumptions that I cannot find in the original question. For example, for

      my $str1='--M--CCA'; my $str2='-----OOO';

      your script prints OOOOOOOO while I would have thought it should be --O--OOO?

      UPDATE: In order to avoid any confusion, the strings above are NOT part of the original question but examples I constructed assuming they could occur. The purpose of this was to highlight a situation where the proposed script would deliver something that violates the original requirements. Not sure the example is really relevant.

        Good catch. It's a bug. The line:

        next if substr($str1, $start, $matchLen) =~ /^-+$/;

        should be more like (untested):

        next if ! matchLen || substr($str1, $start, $matchLen) !~ /[^-]/;
        Perl is the programming world's equivalent of English
      Thanks so much!

      No, the split function is not a good way to do this trick. A regular expression is a better bet:

      FWIW, split function , same as match operator, both take a regular expression ;)

Re: More efficient way for this pattern match?
by Anonymous Monk on Apr 14, 2015 at 14:44 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1123341 use strict; use warnings; my $str1='MCCAALAPPMAATVGPESIWLWIGTIGMTLGTLYFVGRGRGVRDRKMQEFYIITIFITTI +AAAMYFAMATGFGVT-------------EVMVG----DE---ALTIYWARYADWLFTTPLLLLDLSLLA +GANRN----TIATLIG-LDVFMIG---T---GAIAALSST-PGTRIAWWAIST--GALL--ALLYVLVG +TLSENARNRAPEVA--SLFGRLRNLVIALWFLYPVVWILGT---EGTFGILP--LYWETAAFMVLDLSA +KVGFGVILLQSRSVLERVATPTAAPT'; my $str2='--OOOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMMIIIIIIIIIIMMMMMMMMMMM +MMMMMMMMMMOOOOO-------------OOOOO----OO---OOOOMMMMMMMMMMMMMMMMMMMMMII +IIIII----MMMMMMM-MMMMMMM---M---MMMMMMOOO-OOOOMMMMMMMM--MMMM--MMMMMMMM +MMIIIIIIIIIIII--IIIIMMMMMMMMMMMMMMMMMMMMO---OOO-OOOO--OOOMMMMMMMMMMMM +MMMMMMMMMIIIIIIIIIIIII----'; my ($last) = $str2 =~ /.*(\w)/; # find letter for ending -'s tr/\0-/-=/, s/=(?=\W*(\w?))/ $1 || $last /ge for my $answer = $str1 =~ tr/A-Z/\0/r ^ $str2; print "$answer\n";

    Explanation provided on request :)

Re: More efficient way for this pattern match?
by Anonymous Monk on Apr 14, 2015 at 19:41 UTC

    See also this thread for a very similar, though not identical, problem. Also, Hello! to your perl teacher.

    #! /usr/bin/perl sub ano1 { my ($str1, $str2) = @_; my ($last) = $str2 =~ /.*(\w)/; tr/\0-/-=/, s/=(?=\W*(\w?))/ $1 || $last /ge for my $answer = $str1 =~ tr/A-Z/\0/r ^ $str2; $answer; } sub ano2 { my ($x, $y) = @_; $y &= ($y =~ tr/-\0-\377/\0\377/r) | ($x =~ tr/-\0-\377/\377\0/r); $y =~ s/(\0+)(.)/$2 x (1 + length $1)/eg; $y =~ s/(.)(\0+)$/$1 x (1 + length $2)/er; } chomp (our ($str1, $str2) = <DATA>); use Benchmark 'cmpthese'; cmpthese -5, { ano1 => q( ano1 $str1, $str2 ), ano2 => q( ano2 $str1, $str2 ), }; __DATA__ MCCAALAPPMAATVGPESIWLWIGTIGMTLGTLYFVGRGRGVRDRKMQEFYIITIFITTIAAAMYFAMAT +GFGVT-------------EVMVG----DE---ALTIYWARYADWLFTTPLLLLDLSLLAGANRN----T +IATLIG-LDVFMIG---T---GAIAALSST-PGTRIAWWAIST--GALL--ALLYVLVGTLSENARNRA +PEVA--SLFGRLRNLVIALWFLYPVVWILGT---EGTFGILP--LYWETAAFMVLDLSAKVGFGVILLQ +SRSVLERVATPTAAPT --OOOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMMIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMM +OOOOO-------------OOOOO----OO---OOOOMMMMMMMMMMMMMMMMMMMMMIIIIIII----M +MMMMMM-MMMMMMM---M---MMMMMMOOO-OOOOMMMMMMMM--MMMM--MMMMMMMMMMIIIIIIII +IIII--IIIIMMMMMMMMMMMMMMMMMMMMO---OOO-OOOO--OOOMMMMMMMMMMMMMMMMMMMMMI +IIIIIIIIIIII----

      Sigh, this is too crazy. I give up, and I'll go back to golf!

      This runs faster as a one-liner than as multiple lines. :(

      Also tr/A-Z/\0/r is faster than tr/-/\0/cr It makes no sense :(

      sub ano1 { (pop ^ pop =~ tr/A-Z/\0/r) =~ tr/\0-/-=/r =~ s/(=+(\w))/$2 +x length $1/ger =~ s/((.)=+)$/$2 x length $1/er }