in reply to string manipulation with Regex

I think this will do what you want, but I haven't been able to fully test it, for lack of a few examples, so I made up the input string:
use strict; use warnings; use 5.012; my $command = '8M5I4D5M'; my $input = 'M234567MI234IM234M'; while ($command =~ m/(\d+)([MID])/g) { my $value = $1; my $code = $2; given($code) { when ('D') { print 'X' x $value; } when ('I') { $input = substr($input, $value); } when ('M') { print substr($input, 0, $value); $input = substr($input, $value); } } }
output:
M234567MXXXXM234M

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Replies are listed 'Best First'.
Re^2: string manipulation with Regex
by FluffyBunny (Acolyte) on Sep 01, 2010 at 20:19 UTC

    Dear CountZero and other senior Perlmonks who replied to my post: This is basically what I am trying to do:

    you have a text file with string like this

    CGAATTAATGGGAATTG

    and you have your reference sequence saying

    CGAATTAAGGAATTG

    note your input has two letters inserted, which are TG

    So you will get CIGAR ID from your alignment program saying

    8M2I7M

    To help you understand visually (so originally no blank), I manually aligned these two string variables.

    CGAATTAATGGGAATTG

    CGAATTAA GGAATTG

    So using this I would like to keep the same original letter position for inputs... that is why I have to compare input string variables to reference string variables to make them have the same letter position (basically insertions are useless)

    For deletions, I have the exact opposite situation.. so lets flip my first example's situation This case it will be 8M2D7M

    CGAATTAA GGAATTG <--- my input for 2nd example (the gap is again intentionally made)

    CGAATTAATGGGAATTG <--- my output for 2nd example

    you see I am missing two letters and I need some letter holders to keep the input's letter positions the same compared to the reference.. so I want to fill two X's

    CGAATTAAXXGGAATTG <--- same length, other letter positions will be the same

    I also posted the link which leads to the original post that I had with a different problem (already fixed) there you can find my input files. Thank you for your help

      Update: missed the part about multiple I,D sections....so adjusted loop to do that. And now I see that there was some Count Zero code prior to the thread level I've replied to. His code looks fine to me. What I did is very similar except that I used substr() instead of print.

      I think this does what you want. Basically in the CIGAR, an insertion becomes a deletion and vice-versa. So I use the edit instructions in the CIGAR in an inverse sense.

      The total field lengths in the CIGAR (viewed in inverted sense) may be less than the number of characters in the input, so I think this means truncate the output to whatever that total is.

      whether or not some final adjustment to either truncate or perhaps add more "X"'s after inverse of all editing commands is unclear to me - just a matter of knowing what is required - that's why I kept a running tally of the total length.

      #!/usr/bin/perl -w use strict; while (<DATA>) { next if /^\s*$/; #skip blank lines my ($input, $CIGAR) = split; my $ref = $input; #working copy of $input my (@edit_cmd) = $CIGAR =~ m/\d+\w/g; my $curr_pos = 0; my $total_len =0; foreach my $cmd (@edit_cmd) { if (my ($M) = $cmd =~ m/(\d+)M/) { $curr_pos += $M; $total_len+= $M; } elsif (my ($I) = $CIGAR =~ m/(\d+)I/) { substr($ref,$curr_pos,$I,''); #delete $I characters $total_len -= $I } elsif (my ($D) = $CIGAR =~ m/(\d+)D/) { substr($ref,$curr_pos,0,"X" x $D); #insert $D X's $total_len += $D; $curr_pos += $D; } } $ref = substr($ref,0,$total_len); #truncate ????? print "INPUT = $input CIGAR = $CIGAR\n"; print "REF = $ref\n\n"; } =prints INPUT = CGAATTAATGGGAATTG CIGAR = 8M2I7M REF = CGAATTAAGGAAT INPUT = CGAATTAATGGGAATTG CIGAR = 2M2I2M3D10M REF = CGTTTGGGAA INPUT = CGAATTAATGGGA CIGAR = 8M2D7M REF = CGAATTAAXXTGGGA =cut __DATA__ CGAATTAATGGGAATTG 8M2I7M CGAATTAATGGGAATTG 2M2I2M3D10M CGAATTAATGGGA 8M2D7M
        Thank you very much for your assistance! It makes sense to me. Now I have to work on multiple D and I combo. You are all awesome Perlmonks! =D

        That almost worked for my own tests. I switched $CIGAR to $cmd in the elsif statements and it started doing it no matter what combination of I's, D's, or M's.

        Thanks for your help! =)

      Did you actually try my program with the sample inputs you mention above?

      If so, you will have seen that the output is exactly as you expect! So what is still your problem?

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        I just wanted to explain more clearly. The code itself is working. Thank you for your help =)