Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!
I have some thousands of strings like the following:
>id1 ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss +sssssssssssssssssssssssssssssssssssssssssssssssss >id_2 ssDDDDDDDDDDDDDDDDDDDDDDssssssssDDDDDDDDDDssssssssDDDDDssssssssssssssD +DDDDDDDDDDDDDDDsDDDDDDDDDssssssssssssssUUUUUUUUUs ...

where I would like to do the following substitution:
1. replace all s prior to U with i and all s following a U with o
2. replace all s prior to D with o and all s following a D with i
3. replace all D and U with M
4. if no D or U are present, then all s becomes

I wrote the following, but it does not seem to work all the time properly (probably could be written more simply too):
while(<>) { if($_=~/^>/) { $id=$_; $seq=<>; print $id; if($seq=~/^s+$/) { $seq=~s/s/I/g; } else { while($seq=~/(s+)(U+)(s+)/g) { $part_before_U=$1; $len1_U=length($part_before_U); $part_U=$2; $part_after_U=$3; $len2_U=length($part_after_U); $in_part_U='I' x $len1_U; $out_part_U = 'O' x $len2_U; $seq=~s/$part_before_U/$in_part_U/; $seq=~s/$part_after_U/$out_part_U/; } while($seq=~/(s+)(D+)(s+)/g) { $part_before_D=$1; $len1_D=length($part_before_D); $part_D=$2; $part_after_D=$3; $len2_D=length($part_after_D); $out_part_D = 'O' x $len1_D; $in_part_D='I' x $len2_D; $seq=~s/$part_before_D/$out_part_D/; $seq=~s/$part_after_D/$in_part_D/; } $seq=~s/U/M/g; $seq=~s/D/M/g; } print $seq; } }

Replies are listed 'Best First'.
Re: Pattern matching simultaneous substitution
by kcott (Archbishop) on Jan 05, 2022 at 20:24 UTC

    Your stated requirements have problems:

    1. In isolation, this is pretty straightforward. The following would need some additional "checking" code, but the guts of it are:
      $ perl -E ' my $str = "sssDDDsssDDDssUss"; my ($fore, $aft) = split /U/, $str, 2; $fore =~ s/s/i/g; $aft =~ s/s/o/g; say "$str\n", join "U", $fore, $aft; ' sssDDDsssDDDssUss iiiDDDiiiDDDiiUoo
    2. This makes no sense as there are no more 's' characters left to modify. If there's some typo in what you wrote, then perhaps something similar to the code in my last point would suffice.
    3. This is easily achieved with transliteration:
      $ perl -E ' my $str = "sssDDDsssDDDssUss"; say $str; $str =~ y/DU/M/; say $str; ' sssDDDsssDDDssUss sssMMMsssMMMssMss
    4. You didn't finish writing this point: "... then all s becomes". You'll need to tell us what you intended to write after "becomes".

    Some other points:

    • Your input data is far too long. The same information could have been conveyed with strings of less than a dozen characters.
    • You don't tell us what output you expected.
    • You don't tell us what output you got from your posted code.
    • Please do not italicise your code. It's harder to read; in particular, backslashes (\) can look like pipes (|).
    • Your input looks like FASTA format. If you need it, there's plenty of examples of parsing that format on this site.

    — Ken

      Thanks Ken. Re point #2, I put it there, because there can be cases without Ds, but only Us. Therefore there needs to be a conditional on that scenario as well.
Re: Pattern matching simultaneous substitution
by choroba (Cardinal) on Jan 05, 2022 at 20:15 UTC
    > 4. if no D or U are present, then all s becomes

    Becomes what?

    If the script works correctly for the sample input, please add a sequence that produces a wrong output.

    Moreover, also include the expected output for the part that the script doesn't process correctly.

    General comment: Use strict and warnings. They prevent some beginner mistakes.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Sorry, I missed that :)
      - if no D or U are present, then all s becomes i
      The following one was not properly converted:
      old:ssssssDDDDDDDDDDDDDDsssssssssssssDDDDDssssssssssssssssssssssssssss +ssssssssssssssDDDDDDDssssssssssssssssssssssssssssssssssssssssssssssss +sssssssssssssDDDDDDssssssssssssssssssssssssssssssssssssssssssssssssss +sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss +ssssssssssss new:OOOOOOMMMMMMMMMMMMMMIIIIIIIIIIIIIMMMMMOOOOOOOOOOOOOOOOOOOOOOOOOOOO +OOOOOOOOOOOOOOMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIMMMMMMssssssssssssssssssssssssssssssssssssssssssssssssss +sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss +ssssssssssss
        I still don't understand the rules. If the input is
        DDssDD
        what should it become? Are the "s" prior to "D" or following a "D"?

        In the new/old example, why are the final "s" not replaced? Don't they follow a "D"?

        Please, try to be more precise.

        Also, you can easily shorten the data, 2 consecutive characters of each type would do.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Pattern matching simultaneous substitution
by tybalt89 (Monsignor) on Jan 05, 2022 at 23:36 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11140181 use warnings; my %h = qw( bU i aU o bD o aD i ); while( <DATA> ) { print "\n$_" . <DATA>; /U|D/ or tr/s/i/; s/(s*)(U|D)(s*)/$h{"b$2"} x length($1) . 'M' . $h{"a$2"}x length($3) + /ge; print; } __DATA__ sssss iiiii sssUUss iiiMMoo sssDDss oooMMii sssssDDDDDDDssssUUUss oooooMMMMMMMiiiiMMMoo sssssUUUUUUUssssDDDss iiiiiMMMMMMMooooMMMii ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss +sssssssssssssssssssssssssssssssssssssssssssssssss ??? ssDDDDDDDDDDDDDDDDDDDDDDssssssssDDDDDDDDDDssssssssDDDDDssssssssssssssD +DDDDDDDDDDDDDDDsDDDDDDDDDssssssssssssssUUUUUUUUUs ???

    Outputs (input/expected/actual):

    sssss iiiii iiiii sssUUss iiiMMoo iiiMMoo sssDDss oooMMii oooMMii sssssDDDDDDDssssUUUss oooooMMMMMMMiiiiMMMoo oooooMMMMMMMiiiiMMMoo sssssUUUUUUUssssDDDss iiiiiMMMMMMMooooMMMii iiiiiMMMMMMMooooMMMii ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss +sssssssssssssssssssssssssssssssssssssssssssssssss ??? iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii +iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii ssDDDDDDDDDDDDDDDDDDDDDDssssssssDDDDDDDDDDssssssssDDDDDssssssssssssssD +DDDDDDDDDDDDDDDsDDDDDDDDDssssssssssssssUUUUUUUUUs ??? ooMMMMMMMMMMMMMMMMMMMMMMiiiiiiiiMMMMMMMMMMiiiiiiiiMMMMMiiiiiiiiiiiiiiM +MMMMMMMMMMMMMMMiMMMMMMMMMiiiiiiiiiiiiiiMMMMMMMMMo

    ??? where no expected output was provided.