Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks!
I have a number of sequences that contain some characters (specifically I,M,O,P,B) and the character U (some times) and I want to get rid of the U's.
There can be these 3 cases:

1. U is in the beginning of the string, like the following example:
$seq=UUUUUUUUIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOO

Here, it would be replaced by I, and I do this by writing:
if($seq=~/^(U+)([I|O|P|B|M])/) { $part_to_change1=$1; $len1=length($part_to_change1); $char1=$2; substr($top, 0, $len1, ($char1 x $len1)); }

2. U is in the end, like the following example:
$seq=IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOUUUUUUUU

Here, U would be changed to O, and for that I use the following commands:
if($seq=~/.*?([I|O])(U+)$/) { $char2=$1; $part_to_change2=$2; $len2=length($part_to_change2); substr($top, -$len2, $len2, ($char2 x $len2)); }

So now, what I am missing is the way to replace U when I find it in the middle of the sequence, like the following examples:
* $seq=IIIIIIIIIIIIIIIMMMMMMMMMMMUUUUUUUUMMMMMMMMMMMMMOOOO * $seq=IIIIIIIUUUUUIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOUUUUUUUOO * $seq=IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOUUOO

In all the above cases, the U needs to be changed to the character that it is 'encapsulated' within, i.e U -> M for the first example, U -> I and U ->O for the second example and U -> O for the third. Can you give me some help?

Replies are listed 'Best First'.
Re: Replace characters within string
by tybalt89 (Monsignor) on Sep 07, 2022 at 22:17 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11146755 use warnings; while( <DATA> ) { chomp; print "\n$_\n"; s/^(U+)/ 'I' x length $1 /e; s/(U+)$/ 'O' x length $1 /e; s/(?<=([^U]))(U+)(?=\1)/ $1 x length $2 /ge; print "$_\n"; } __DATA__ UUUUUUUUIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOO IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOUUUUUUUU IIIIIIIIIIIIIIIMMMMMMMMMMMUUUUUUUUMMMMMMMMMMMMMOOOO IIIIIIIUUUUUIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOUUUUUUUOO IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOUUOO IUIUIUIUIUIUIUIUI

    Outputs:

    UUUUUUUUIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOO IIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOO IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOUUUUUUUU IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOOOOOOOOO IIIIIIIIIIIIIIIMMMMMMMMMMMUUUUUUUUMMMMMMMMMMMMMOOOO IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMOOOO IIIIIIIUUUUUIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOUUUUUUUOO IIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOOOOOOO IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOUUOO IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOO IUIUIUIUIUIUIUIUI IIIIIIIIIIIIIIIII
Re: Replace characters within string
by hv (Prior) on Sep 07, 2022 at 22:08 UTC

    For the third part, if I understand the requirement correctly, you could achieve it with something like this:

    $seq =~ s{ ([^U]) (?# anything other than U can be the "bracketing" character) (U+) (?# match one or more Us) (?=\1) (?# followed by the bracketing character again) }{ # replace the bracketing character and the Us that follow it # with an equal number of copies of the bracketing character $1 x (1 + length($2)) }xeg; # replace all embedded sequences of Us in one go

    Note that this uses a lookahead to match the second copy of the bracketing character, to ensure that all matches can be replaced with a single invocation - I assume we want to allow "IUIUI" to be translated to "IIIII".

    The same approach can also be used to simplify the head and tail matches:

    $seq =~ s{^(U+)}{"I" x length($1)}eg; $seq =~ s{(U+)$}{"O" x length($1)}eg;

    Note that I have ignored your assertion in the first case that the sequence of initial Us must be followed by one of I, O, P, B, M since that appears to be guaranteed; I've also ignored the assertion in the second case that the sequence of final Us must be preceded by I or O, since you don't anywhere say that that is required. (If it _is_ required, I'd use a lookbehind for that assertion.)

    Hope this helps

Re: Replace characters within string
by kcott (Archbishop) on Sep 08, 2022 at 04:46 UTC

    Do you have Perl 5.14 (or later) for "Non-destructive substitution"? If so:

    $ perl -E ' use v5.14; my $seq = "UUUIUUIMUUMOUUOPUUPBUUBUUU"; my $exp = "IIIIIIIMMMMOOOOPPPPBBBBBBB"; say $seq; say $seq =~ s/^(U+)(.)/$2 x length($1) . $2/er =~ s/(.)(U+)/$1 . $1 x length($2)/egr; say $exp; ' UUUIUUIMUUMOUUOPUUPBUUBUUU IIIIIIIMMMMOOOOPPPPBBBBBBB IIIIIIIMMMMOOOOPPPPBBBBBBB

    — Ken

Re: Replace characters within string
by AnomalousMonk (Archbishop) on Sep 08, 2022 at 01:30 UTC

    The very first thing to do in a case like this is to write a test plan, then write a whole bunch of tests. I leave all the rest of the tests to you.

    Win8 Strawberry 5.8.9.5 (32) Wed 09/07/2022 21:20:21 C:\@Work\Perl\monks >perl use strict; use warnings; use Test::More; use Test::NoWarnings; # use Data::Dump qw(dd); # for debug my @Tests = ( 'ALL these strings have replacements', [ 'UUUUUUUUIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOO', 'IIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOO', ], [ 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOUUUUUUUU', 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOOOOOOOOO', ], [ 'IIIIIIIIIIIIIIIMMMMMMMMMMMUUUUUUUUMMMMMMMMMMMMMOOOO', 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMOOOO', ], [ 'IIIIIIIUUUUUIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOUUUUUUUOO', 'IIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOOOOOOO', ], [ 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOUUOO', 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOO', ], 'NONE of these strings have replacements', [ ('IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMMMUUOO') x 2 ], ); # end @Tests my @additional = qw(Test::NoWarnings); # each of these adds 1 test plan 'tests' => (scalar grep { ref eq 'ARRAY' } @Tests) + @additional; VECTOR: for my $ar_vector (@Tests) { if (not ref $ar_vector) { note $ar_vector; next VECTOR; } my ($input, $expected) = @$ar_vector; my $got = replace($input); is $got, $expected, "'$input' \n '$got'" ; } # end for VECTOR sub replace { # works my ($str, ) = @_; # only ONE or NONE of these replacements will be made. $str =~ s{ \A (U+) (?= ([IOPBM])) } { $2 x length $1 }xmse o +r $str =~ s{ (?<= ([IO])) (U+) \z } { $1 x length $2 }xmse o +r $str =~ s{ (?<= ([IOPBM])) (U+) (?= \1) }{ $1 x length $2 }xmseg ; return $str # return string with possible replacements } # end sub replace() ^Z 1..7 # ALL these strings have replacements ok 1 - 'UUUUUUUUIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOO' # 'IIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOO' ok 2 - 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOUUUUUUUU' # 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOOOOOOOOO' ok 3 - 'IIIIIIIIIIIIIIIMMMMMMMMMMMUUUUUUUUMMMMMMMMMMMMMOOOO' # 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMOOOO' ok 4 - 'IIIIIIIUUUUUIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOUUUUUUUOO' # 'IIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOOOOOOO' ok 5 - 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOUUOO' # 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMOOOOOO' # NONE of these strings have replacements ok 6 - 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMMMUUOO' # 'IIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMMMUUOO' ok 7 - no warnings
    See also How to ask better questions using Test::More and sample data.


    Give a man a fish:  <%-{-{-{-<

Re: Replace characters within string
by AnomalousMonk (Archbishop) on Sep 08, 2022 at 02:41 UTC
Re: Replace characters within string
by Anonymous Monk on Sep 07, 2022 at 23:42 UTC

    Stuff like [I|O|P|B|M] doesn't do what you think it does. In character classes (the square brackets) the pipe character is a literal pipe character, not an operator. So you should write [IOPBM].

    See "Special Characters Inside a Bracketed Character Class" in perlrecharclass for details.

    Yet another proof that Guido was right.

    Are these characters amino acids or something? It's a good thing that Perl is still used by geneticists, otherwise we would've already been exterminated by engineered diseases.