in reply to Re^2: string manipulation with Regex
in thread string manipulation with Regex
I think this does what you want. Basically in the CIGAR, an insertion becomes a deletion and vice-versa. So I use the edit instructions in the CIGAR in an inverse sense.
The total field lengths in the CIGAR (viewed in inverted sense) may be less than the number of characters in the input, so I think this means truncate the output to whatever that total is.
whether or not some final adjustment to either truncate or perhaps add more "X"'s after inverse of all editing commands is unclear to me - just a matter of knowing what is required - that's why I kept a running tally of the total length.
#!/usr/bin/perl -w use strict; while (<DATA>) { next if /^\s*$/; #skip blank lines my ($input, $CIGAR) = split; my $ref = $input; #working copy of $input my (@edit_cmd) = $CIGAR =~ m/\d+\w/g; my $curr_pos = 0; my $total_len =0; foreach my $cmd (@edit_cmd) { if (my ($M) = $cmd =~ m/(\d+)M/) { $curr_pos += $M; $total_len+= $M; } elsif (my ($I) = $CIGAR =~ m/(\d+)I/) { substr($ref,$curr_pos,$I,''); #delete $I characters $total_len -= $I } elsif (my ($D) = $CIGAR =~ m/(\d+)D/) { substr($ref,$curr_pos,0,"X" x $D); #insert $D X's $total_len += $D; $curr_pos += $D; } } $ref = substr($ref,0,$total_len); #truncate ????? print "INPUT = $input CIGAR = $CIGAR\n"; print "REF = $ref\n\n"; } =prints INPUT = CGAATTAATGGGAATTG CIGAR = 8M2I7M REF = CGAATTAAGGAAT INPUT = CGAATTAATGGGAATTG CIGAR = 2M2I2M3D10M REF = CGTTTGGGAA INPUT = CGAATTAATGGGA CIGAR = 8M2D7M REF = CGAATTAAXXTGGGA =cut __DATA__ CGAATTAATGGGAATTG 8M2I7M CGAATTAATGGGAATTG 2M2I2M3D10M CGAATTAATGGGA 8M2D7M
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: string manipulation with Regex
by FluffyBunny (Acolyte) on Sep 02, 2010 at 15:24 UTC | |
by Marshall (Canon) on Sep 02, 2010 at 15:48 UTC | |
|
Re^4: string manipulation with Regex
by bluecompassrose (Initiate) on Sep 02, 2010 at 16:06 UTC |