Re: string manipulation with Regex

Replies are listed 'Best First'.
Re^2: string manipulation with Regex by FluffyBunny (Acolyte) on Sep 01, 2010 at 20:19 UTC
Dear CountZero and other senior Perlmonks who replied to my post: This is basically what I am trying to do: you have a text file with string like this CGAATTAATGGGAATTG and you have your reference sequence saying CGAATTAAGGAATTG note your input has two letters inserted, which are TG So you will get CIGAR ID from your alignment program saying 8M2I7M To help you understand visually (so originally no blank), I manually aligned these two string variables. CGAATTAATGGGAATTG CGAATTAA GGAATTG So using this I would like to keep the same original letter position for inputs... that is why I have to compare input string variables to reference string variables to make them have the same letter position (basically insertions are useless) For deletions, I have the exact opposite situation.. so lets flip my first example's situation This case it will be 8M2D7M CGAATTAA GGAATTG <--- my input for 2nd example (the gap is again intentionally made) CGAATTAATGGGAATTG <--- my output for 2nd example you see I am missing two letters and I need some letter holders to keep the input's letter positions the same compared to the reference.. so I want to fill two X's CGAATTAAXXGGAATTG <--- same length, other letter positions will be the same I also posted the link which leads to the original post that I had with a different problem (already fixed) there you can find my input files. Thank you for your help	[reply]
Re^3: string manipulation with Regex by Marshall (Canon) on Sep 01, 2010 at 22:48 UTC
Update: missed the part about multiple I,D sections....so adjusted loop to do that. And now I see that there was some Count Zero code prior to the thread level I've replied to. His code looks fine to me. What I did is very similar except that I used substr() instead of print. I think this does what you want. Basically in the CIGAR, an insertion becomes a deletion and vice-versa. So I use the edit instructions in the CIGAR in an inverse sense. The total field lengths in the CIGAR (viewed in inverted sense) may be less than the number of characters in the input, so I think this means truncate the output to whatever that total is. whether or not some final adjustment to either truncate or perhaps add more "X"'s after inverse of all editing commands is unclear to me - just a matter of knowing what is required - that's why I kept a running tally of the total length. #!/usr/bin/perl -w use strict; while (<DATA>) { next if /^\s*$/; #skip blank lines my ($input, $CIGAR) = split; my $ref = $input; #working copy of $input my (@edit_cmd) = $CIGAR =~ m/\d+\w/g; my $curr_pos = 0; my $total_len =0; foreach my $cmd (@edit_cmd) { if (my ($M) = $cmd =~ m/(\d+)M/) { $curr_pos += $M; $total_len+= $M; } elsif (my ($I) = $CIGAR =~ m/(\d+)I/) { substr($ref,$curr_pos,$I,''); #delete $I characters $total_len -= $I } elsif (my ($D) = $CIGAR =~ m/(\d+)D/) { substr($ref,$curr_pos,0,"X" x $D); #insert $D X's $total_len += $D; $curr_pos += $D; } } $ref = substr($ref,0,$total_len); #truncate ????? print "INPUT = $input CIGAR = $CIGAR\n"; print "REF = $ref\n\n"; } =prints INPUT = CGAATTAATGGGAATTG CIGAR = 8M2I7M REF = CGAATTAAGGAAT INPUT = CGAATTAATGGGAATTG CIGAR = 2M2I2M3D10M REF = CGTTTGGGAA INPUT = CGAATTAATGGGA CIGAR = 8M2D7M REF = CGAATTAAXXTGGGA =cut __DATA__ CGAATTAATGGGAATTG 8M2I7M CGAATTAATGGGAATTG 2M2I2M3D10M CGAATTAATGGGA 8M2D7M [download]	[reply] [d/l]
Re^4: string manipulation with Regex by FluffyBunny (Acolyte) on Sep 02, 2010 at 15:24 UTC
Thank you very much for your assistance! It makes sense to me. Now I have to work on multiple D and I combo. You are all awesome Perlmonks! =D	[reply]
Re^5: string manipulation with Regex by Marshall (Canon) on Sep 02, 2010 at 15:48 UTC
Re^4: string manipulation with Regex by bluecompassrose (Initiate) on Sep 02, 2010 at 16:06 UTC
That almost worked for my own tests. I switched $CIGAR to $cmd in the elsif statements and it started doing it no matter what combination of I's, D's, or M's. Thanks for your help! =)	[reply]
Re^3: string manipulation with Regex by CountZero (Bishop) on Sep 01, 2010 at 21:45 UTC
Did you actually try my program with the sample inputs you mention above? If so, you will have seen that the output is exactly as you expect! So what is still your problem? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^4: string manipulation with Regex by FluffyBunny (Acolyte) on Sep 02, 2010 at 15:19 UTC
I just wanted to explain more clearly. The code itself is working. Thank you for your help =)	[reply]