in reply to Re^2: string manipulation with Regex
in thread string manipulation with Regex

Update: missed the part about multiple I,D sections....so adjusted loop to do that. And now I see that there was some Count Zero code prior to the thread level I've replied to. His code looks fine to me. What I did is very similar except that I used substr() instead of print.

I think this does what you want. Basically in the CIGAR, an insertion becomes a deletion and vice-versa. So I use the edit instructions in the CIGAR in an inverse sense.

The total field lengths in the CIGAR (viewed in inverted sense) may be less than the number of characters in the input, so I think this means truncate the output to whatever that total is.

whether or not some final adjustment to either truncate or perhaps add more "X"'s after inverse of all editing commands is unclear to me - just a matter of knowing what is required - that's why I kept a running tally of the total length.

#!/usr/bin/perl -w use strict; while (<DATA>) { next if /^\s*$/; #skip blank lines my ($input, $CIGAR) = split; my $ref = $input; #working copy of $input my (@edit_cmd) = $CIGAR =~ m/\d+\w/g; my $curr_pos = 0; my $total_len =0; foreach my $cmd (@edit_cmd) { if (my ($M) = $cmd =~ m/(\d+)M/) { $curr_pos += $M; $total_len+= $M; } elsif (my ($I) = $CIGAR =~ m/(\d+)I/) { substr($ref,$curr_pos,$I,''); #delete $I characters $total_len -= $I } elsif (my ($D) = $CIGAR =~ m/(\d+)D/) { substr($ref,$curr_pos,0,"X" x $D); #insert $D X's $total_len += $D; $curr_pos += $D; } } $ref = substr($ref,0,$total_len); #truncate ????? print "INPUT = $input CIGAR = $CIGAR\n"; print "REF = $ref\n\n"; } =prints INPUT = CGAATTAATGGGAATTG CIGAR = 8M2I7M REF = CGAATTAAGGAAT INPUT = CGAATTAATGGGAATTG CIGAR = 2M2I2M3D10M REF = CGTTTGGGAA INPUT = CGAATTAATGGGA CIGAR = 8M2D7M REF = CGAATTAAXXTGGGA =cut __DATA__ CGAATTAATGGGAATTG 8M2I7M CGAATTAATGGGAATTG 2M2I2M3D10M CGAATTAATGGGA 8M2D7M

Replies are listed 'Best First'.
Re^4: string manipulation with Regex
by FluffyBunny (Acolyte) on Sep 02, 2010 at 15:24 UTC
    Thank you very much for your assistance! It makes sense to me. Now I have to work on multiple D and I combo. You are all awesome Perlmonks! =D
      take a look multiple D and multiple I are already implemented.
Re^4: string manipulation with Regex
by bluecompassrose (Initiate) on Sep 02, 2010 at 16:06 UTC

    That almost worked for my own tests. I switched $CIGAR to $cmd in the elsif statements and it started doing it no matter what combination of I's, D's, or M's.

    Thanks for your help! =)