FluffyBunny has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl monks,

A noob bioinfomatics student (yes, me) needs your knowledge! I am trying to manipulate string (no space between) with string+integer input(also no space between)using regular expression.

From an alignment program, I get mix of string+integer (called CIGAR ID) and this is after comparing an input string to a reference string. For example, you get "60M2D142M" and this means "60 match, 2 deletion, 142 match" in my input string compared to a reference string (direction: from left to right). Another example is "12M1I192M" and this means "12 match 1 insertion 192 match" in my input string. The length of this string matches to the reference when the integers are added together for deletion (for example, 60+2+142=204 for first example) and total not including number of insertion for insertion case (12+192=204 for 2nd example).

I know parsing this by number and letter (M,D,I) is easy.

split /[A-Z]/

So my first example after parsing will be an array of indexes "60 M 2 D 142 M" However, I need to use all these info so I can make sure to skip 60 letters and add two Xs to make up the deletion of two letters and skip 142 (this will make the same length as the reference string. For my second example, I will get "12 M 1 I 192 M" and I need to get rid of this insertion after 12 letters and skip the rest of letters to make the same length as the reference.

My code is already getting complicated as I have to worry about this CIGAR ID that comes with multiple "I"s and "D"s..

this is a part of my code:

my %bwa = (); my @splitstore=(); my @s1 = (); my @s2 = (); my @s3 = (); my @s4 = (); my @s5 = (); my $ssl=(0); my $splitstoreD=(0); my $splitstoreI=(0); my $file1 = shift; open (FILE1, "$file1") || die "Failed to open $file1 for reading : $!" +; # Open second file while (<FILE1>) { # Reading second hash if ($_ =~ /^[^@]/s) { chomp; my @line = split /\s+/, $_; my $ID; if ($line[2] =~ /[^*]/) { $ID = $line[0]; push @{$bwa{$ID}}, @line[0,2,5,12,15,16,17,18,3]; #$bwa{$ID}[0] = $line[0]; # seq ID #$bwa{$ID}[1] = $line[2]; # Ref ID #$bwa{$ID}[2] = $line[5]; # CIGAR ID for insertion #$bwa{$ID}[3] = @line[9]; # Processed seq (already C->T) #$bwa{$ID}[3] = $line[12]; # Edit distance (edited area by + # of base) : NM #$bwa{$ID}[4] = $line[15]; # No. of mismatches in the alig +nment : XM #$bwa{$ID}[5] = $line[16]; # No. of gap opens for insertio +n : XO #$bwa{$ID}[6] = $line[17]; # No. of gap extensions for del +etion :XG #$bwa{$ID}[7] = $line[18]; # Mismatching positions / bases + : MD #$bwa{$ID}[8] = $line[3]; # Starting position # These are not from my input, I made these so I can save +some data for parsing the CIGAR ID. #$bwa{$ID}[10] = Total Number of Bases #$bwa{$ID}[11] = Deletion (0 if no) #$bwa{$ID}[12] = Insertion (0 if no) #$bwa{$ID}[13] = Number of bases until deletion #$bwa{$ID}[14] = Number of bases until insertion # Modifying CIGAR ID's to keep track of matches, deletions +, insertions @splitstore = split /[A-Z]/, $bwa{$ID}[2]; if ($bwa{$ID}[2] =~ /[D]/) { $splitstoreD = 1; } else { $s +plitstoreD = 0 } if ($bwa{$ID}[2] =~ /[I]/) { $splitstoreI = 1; } else { $s +plitstoreI = 0 } #Storing Total Bases $ssl = scalar @splitstore; if($ssl>1){ if ($ssl>4){ #If there is both a deletion and insertio +n... $bwa{$ID}[10] = $splitstore[0]+$splitstore[2]+$spl +itstore[$ssl-1]; #add first, third and last nubmers to get total base +s } else{ $bwa{$ID}[10] = $splitstore[0]+$splitstore[$ssl-1];} # +add first and last numbers to get total bases } else { $bwa{$ID}[10] = $splitstore[0]; } #just the first n +umber #Defining Deletions if($splitstoreD == 1) { $bwa{$ID}[11]="Del" } else { $bwa{ +$ID}[11]=0 } #Defining Insertions if($splitstoreI == 1) { $bwa{$ID}[12]="Ins" } else { $bwa{ +$ID}[12]=0 } #Defining Position of Deletion if ($bwa{$ID}[2] =~ /(\d*(?=D))/) { @s1 = split /(\d)/, $bwa{$ID}[2]; @s2 = split /(\d+M)/, @s1; @s3 = split /(\d+M)/, @s2; @s4 = split /(\d+I)/, @s3; @s5 = split /[D]/, @s4; $bwa{$ID}[13]=@s5; } print "@splitstore\t"; print "$splitstoreI\t"; print "$splitstoreD\t"; print "$bwa{$ID}[10]\n"; print "@s1\t"; print "@s2\t"; print "@s3\t"; print "@s4\t"; print "@s5\t"; print "$bwa{$ID}[13]\n"; @s1 = (); @s2 = (); @s3 = (); @s4 = (); @s5 = (); } } } close FILE1 || die "Failed to close $file1 : $!";

This is not completed yet. I am overwhelmed by multiple Is and Ds case...Is there a better way to do? Please help.

Replies are listed 'Best First'.
Re: string manipulation with Regex
by jwkrahn (Abbot) on Aug 31, 2010 at 23:40 UTC
    chomp; my @line = split /\s+/, $_;

    That is usually written as:

    my @line = split;

    if ($bwa{$ID}[2] =~ /[D]/) { $splitstoreD = 1; } else { $s +plitstoreD = 0 } if ($bwa{$ID}[2] =~ /[I]/) { $splitstoreI = 1; } else { $s +plitstoreI = 0 }

    That is usually written as:

    $splitstoreD = $bwa{$ID}[2] =~ /D/ ? 1 : 0; $splitstoreI = $bwa{$ID}[2] =~ /I/ ? 1 : 0;

    Or simply:

    $splitstoreD = $bwa{$ID}[2] =~ /D/; $splitstoreI = $bwa{$ID}[2] =~ /I/;

    #Defining Deletions if($splitstoreD == 1) { $bwa{$ID}[11]="Del" } else { $bwa{ +$ID}[11]=0 } #Defining Insertions if($splitstoreI == 1) { $bwa{$ID}[12]="Ins" } else { $bwa{ +$ID}[12]=0 }

    That is usually written as:

    #Defining Deletions $bwa{$ID}[11] = $splitstoreD == 1 ? 'Del' :0; #Defining Insertions $bwa{$ID}[12] = $splitstoreI == 1 ? 'Ins' : 0;

    @s2 = split /(\d+M)/, @s1; @s3 = split /(\d+M)/, @s2; @s4 = split /(\d+I)/, @s3; @s5 = split /[D]/, @s4;

    You are using an array in scalar context which means that if @s1 has 53 elements, for example, then @s2 = split /(\d+M)/, @s1 is the same as @s2 = split /(\d+M)/, '53'.    The second argument to split has to be a scalar or it will be evaluated as a scalar.


      I got some help from the way you coded... thank you, but I am still in trouble!
Re: string manipulation with Regex
by stevieb (Canon) on Aug 31, 2010 at 22:36 UTC

    Can you post 15-20 lines of the input file as an example? I can see at least a few things that I can fiddle around with to make things easier on the eyes while I'm waiting other work to complete...

    Steve

      For input files, please see

      http://www.perlmonks.org/?node_id=855348
Re: string manipulation with Regex
by perlpie (Beadle) on Sep 01, 2010 at 02:07 UTC

    I cannot tell from your posting what your inputs are and what your goal is.

    It sort of sounds like you're trying to generate the reference string given only the input string and CIGAR ID... or maybe you're just trying to "fill in" the deletions (but not remove the insertions)?

    It would help to have a very clear example of what you have to work with and what you're trying to generate. Even if you have to do the first one by hand and explain how you did it, that would help me understand your goal.

    Often when some code seem hard to write, thinking about how to do it with physical objects helps. For example, you can derive all sorts of sort algorithms by thinking about how you would sort index cards by hand given certain sets of constraints.

      Thank you, perlpie

      Like you mentioned, I am trying to generate the reference string given only the input string and CIGAR ID. I need to fill in with X's with deletion, and I need to get rid of extra letters in the string with inserted letters to make them as similar as reference.. (basically the letters' locations need to be conserved by filling letter X or removing excessive letters that do not match) I hope this makes much more clear now

Re: string manipulation with Regex
by CountZero (Bishop) on Sep 01, 2010 at 12:41 UTC
    I think this will do what you want, but I haven't been able to fully test it, for lack of a few examples, so I made up the input string:
    use strict; use warnings; use 5.012; my $command = '8M5I4D5M'; my $input = 'M234567MI234IM234M'; while ($command =~ m/(\d+)([MID])/g) { my $value = $1; my $code = $2; given($code) { when ('D') { print 'X' x $value; } when ('I') { $input = substr($input, $value); } when ('M') { print substr($input, 0, $value); $input = substr($input, $value); } } }
    output:
    M234567MXXXXM234M

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Dear CountZero and other senior Perlmonks who replied to my post: This is basically what I am trying to do:

      you have a text file with string like this

      CGAATTAATGGGAATTG

      and you have your reference sequence saying

      CGAATTAAGGAATTG

      note your input has two letters inserted, which are TG

      So you will get CIGAR ID from your alignment program saying

      8M2I7M

      To help you understand visually (so originally no blank), I manually aligned these two string variables.

      CGAATTAATGGGAATTG

      CGAATTAA GGAATTG

      So using this I would like to keep the same original letter position for inputs... that is why I have to compare input string variables to reference string variables to make them have the same letter position (basically insertions are useless)

      For deletions, I have the exact opposite situation.. so lets flip my first example's situation This case it will be 8M2D7M

      CGAATTAA GGAATTG <--- my input for 2nd example (the gap is again intentionally made)

      CGAATTAATGGGAATTG <--- my output for 2nd example

      you see I am missing two letters and I need some letter holders to keep the input's letter positions the same compared to the reference.. so I want to fill two X's

      CGAATTAAXXGGAATTG <--- same length, other letter positions will be the same

      I also posted the link which leads to the original post that I had with a different problem (already fixed) there you can find my input files. Thank you for your help

        Update: missed the part about multiple I,D sections....so adjusted loop to do that. And now I see that there was some Count Zero code prior to the thread level I've replied to. His code looks fine to me. What I did is very similar except that I used substr() instead of print.

        I think this does what you want. Basically in the CIGAR, an insertion becomes a deletion and vice-versa. So I use the edit instructions in the CIGAR in an inverse sense.

        The total field lengths in the CIGAR (viewed in inverted sense) may be less than the number of characters in the input, so I think this means truncate the output to whatever that total is.

        whether or not some final adjustment to either truncate or perhaps add more "X"'s after inverse of all editing commands is unclear to me - just a matter of knowing what is required - that's why I kept a running tally of the total length.

        #!/usr/bin/perl -w use strict; while (<DATA>) { next if /^\s*$/; #skip blank lines my ($input, $CIGAR) = split; my $ref = $input; #working copy of $input my (@edit_cmd) = $CIGAR =~ m/\d+\w/g; my $curr_pos = 0; my $total_len =0; foreach my $cmd (@edit_cmd) { if (my ($M) = $cmd =~ m/(\d+)M/) { $curr_pos += $M; $total_len+= $M; } elsif (my ($I) = $CIGAR =~ m/(\d+)I/) { substr($ref,$curr_pos,$I,''); #delete $I characters $total_len -= $I } elsif (my ($D) = $CIGAR =~ m/(\d+)D/) { substr($ref,$curr_pos,0,"X" x $D); #insert $D X's $total_len += $D; $curr_pos += $D; } } $ref = substr($ref,0,$total_len); #truncate ????? print "INPUT = $input CIGAR = $CIGAR\n"; print "REF = $ref\n\n"; } =prints INPUT = CGAATTAATGGGAATTG CIGAR = 8M2I7M REF = CGAATTAAGGAAT INPUT = CGAATTAATGGGAATTG CIGAR = 2M2I2M3D10M REF = CGTTTGGGAA INPUT = CGAATTAATGGGA CIGAR = 8M2D7M REF = CGAATTAAXXTGGGA =cut __DATA__ CGAATTAATGGGAATTG 8M2I7M CGAATTAATGGGAATTG 2M2I2M3D10M CGAATTAATGGGA 8M2D7M
        Did you actually try my program with the sample inputs you mention above?

        If so, you will have seen that the output is exactly as you expect! So what is still your problem?

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: string manipulation with Regex
by DrHyde (Prior) on Sep 01, 2010 at 07:40 UTC
    Have you looked to see if the code's already been written for you in BioPerl?

      I did.. and no, my script is very specific for certain input files... I think I am very slowly getting there... very slowly. :'(

Re: string manipulation with Regex
by bluecompassrose (Initiate) on Sep 01, 2010 at 19:51 UTC

    I've been looking at the code for quite a while, and I've found a (messy) solution to part 1 of 2 of the problem.

    Part 1: Make a way to access the total number of bases, number of insertions, deletions, and matches. Make a way to determine where the insertions and matches exist. That has been the toughest, literally needing me to play around with extra hashes.

    Part 2: To insert X's in deleted segments of the sequence. This is a bit easier now that the location and length of the deletions is known, but I dont know on how to "insert" characters into the middle of a string (is there a "string tokenizer" style of code for perl like there is in java?)

    That said, here's what I figured out after about 12 hours of sitting and coding (I'm not good at it D:)

    use warnings; use strict; use Data::Dumper; ## CAUTION: # of sequences in BWA file and # of sequences in input seq +uencing data should be equal!!## # BWA alignment output (.sam) my %bwa = (); my @splitstore=(); my @splitstoreMatch=(); my @splitstoreDel=(); my @splitstoreIns=(); my @splitcheck=(); my %delxstore=(); my %delystore=(); my %insxstore=(); my %insystore=(); my $ssl=(0); my $file1 = shift; open (FILE1, "$file1") || die "Failed to open $file1 for reading : $!" +; # Open second file while (<FILE1>) { # Reading second hash if ($_ =~ /^[^@]/s) { chomp; my @line = split /\s+/, $_; my $ID; if ($line[2] =~ /[^*]/) { $ID = $line[0]; push @{$bwa{$ID}}, @line[0,2,5,12,15,16,17,18,3]; #$bwa{$ID}[0] = $line[0]; # seq ID #$bwa{$ID}[1] = $line[2]; # Ref ID #$bwa{$ID}[2] = $line[5]; # CIGAR ID for insertion #$bwa{$ID}[3] = @line[9]; # Processed seq (already C->T) #$bwa{$ID}[3] = $line[12]; # Edit distance (edited area by # +of base) : NM #$bwa{$ID}[4] = $line[15]; # No. of mismatches in the alignme +nt : XM #$bwa{$ID}[5] = $line[16]; # No. of gap opens for insertion : + XO #$bwa{$ID}[6] = $line[17]; # No. of gap extensions for deleti +on :XG #$bwa{$ID}[7] = $line[18]; # Mismatching positions / bases : +MD #$bwa{$ID}[8] = $line[3]; # Starting position #$bwa{$ID}[10] = Total Number of Bases #$bwa{$ID}[11] = Deletions (0 if no) #$bwa{$ID}[12] = Insertions (0 if no) #$insstore{$ID}[X] = X,Y of bases to match, then length of de +letion #$delstore{$ID}[X] = X,Y of bases until insertion, then lengt +h of insertion # Modifying CIGAR ID's to keep track of matches, deletions, i +nsertions @splitstore = split /[A-Z]/, $bwa{$ID}[2]; #just the values $ssl = scalar @splitstore; @splitstoreDel = split /[^A-Z]+[M,I](?![D])/, $bwa{$ID}[2]; c +hop(@splitstoreDel); #just the deletions @splitstoreIns = split /[^A-Z]+[M,D](?![I])/, $bwa{$ID}[2]; c +hop(@splitstoreIns); #just the insertions @splitstoreMatch = split /[^A-Z]+[D,I](?![M])/, $bwa{$ID}[2]; + chop(@splitstoreMatch); #just the matches $splitcheck[0] = $bwa{$ID}[2]; #Internal storage of insertion and deletions foreach my $i (@splitstoreDel){ unless($i eq ""){ $bwa{$ID}[11] +=1;} #define number of insertions } foreach my $i (@splitstoreIns){ unless($i eq ""){ $bwa{$ID}[12] +=1;} #define number of insertions } #Storing Total Bases foreach my $i (@splitstoreIns){ #for each insertion unless($i eq ""){ $bwa{$ID}[10] +=$i;} } foreach my $i (@splitstoreMatch){ #for each match unless($i eq ""){ $bwa{$ID}[10] +=$i;} } #Define positions of Deletions and insertions my $tempM = 0; for(my $i = 0;$i<$ssl;$i++){ if ($splitcheck[0] =~ /^[^A-Z]+[M]/) { #if first sectio +n match @splitcheck = split /^[^A-Z]+[M]/, $splitcheck[0]; +#eat first section shift(@splitcheck); #get r +id of "" $tempM = $splitstore[$i] + $tempM; #shift number of + matched bases } elsif ($splitcheck[0] =~ /^[^A-Z]+[D]/){ #if first sect +ion deletion @splitcheck = split /^[^A-Z]+[D]/, $splitcheck[0]; +#eat first section shift(@splitcheck); + #get rid of "" push @{$delxstore{$ID}}, $tempM; ##Note: need to s +tore: ID, X position (number of bases before deletion), Y position (n +umber of bases deleted) push @{$delystore{$ID}}, $splitstore[$i]; $tempM = $splitstore[$i] + $tempM; #shift number +of matched bases } elsif ($splitcheck[0] =~ /^[^A-Z]+[I]/){ #if first sect +ion insertion @splitcheck = split /^[^A-Z]+[I]/, $splitcheck[0]; +#eat first section shift(@splitcheck); + #get rid of "" push @{$insxstore{$ID}}, $tempM; push @{$insystore{$ID}}, $splitstore[$i]; ##Note: +need to store: ID, X position (number of bases before insertion), Y p +osition (number of bases ins) $tempM = $splitstore[$i] + $tempM; #shift number +of matched bases } } ## Bug Checking #print "\n\t@splitstore\t\t"; #print "@splitstoreDel\t\t"; #print "@splitstoreIns\t\t"; #print "$bwa{$ID}[10]\t\t\t"; #if (exists $delxstore{$ID}[1]){print "$delxstore{$ID}[1]\ +t";} #if (exists $delystore{$ID}[1]){print "$delystore{$ID}[1]\ +n";} #if (exists $insxstore{$ID}[1]){print "$insxstore{$ID}[1]\ +t";} #if (exists $insystore{$ID}[1]){print "$insystore{$ID}[1]\ +n";} #print "$insstore{$ID}\n"; } } } close FILE1 || die "Failed to close $file1 : $!";

    Here's a bit of input

    @SQ SN:TMEM200B LN:293 @SQ SN:B3GAT2-2_P001 LN:204 Seq1Perfect 0 B3GAT2-2_P001 1 37 500M400D300M200I100M + * 0 0 GGTTGGTTTTTATTTTTTGGAAGAGTTTTAGATTATAGGTGTTGTCGTCGTT +AGCGAAGAAGAGTACGTCGGGTTGCGCGCGTTGGTGTTGGTGTTTTTGGCGTAGTTAGGCGAGGTTCGC +GTTGCGTTGTTTAGTGGCGCGCGGTAGTTCGGGTCGTTTGTAGCGTCGCGGCGTGGGTACGTGTAGGTG +AGTGTTGGGTAGTT &a==aa=====a======aaaaaaa====aaa==a=aaa=a==a=$a=$a= +=aa$aaaaaaaaa=a$a=$aaa==a$a$a$a==aa=a==aa=a=====aa$a=aa==aaa$aaaa==$a +$a==a$a==a===aa=aa$a$a$aa=aa==$aaa=$a===a=aa$a=$a$aa$a=aaa=a$a=a=aaa= +aaa=a==aaa=aa==

    The huge XXX M number is inflated for the purposes of bug testing (you can replace it with something more sensible, say 55M6D143M), since there may or may not be one or more D's, I's and M's in the CIGAR ID. Note, there won't be two letters in a row (which is good, because my code fails there). The thing I did was get the various information out of this little string into variables that could be used to later insert X's into the Deletions: the XXX D numbers.

    All that said my code is a mess. I dont know how to optimize or how to insert characters into a string, midstring. I can't help much more on my own here, so =(

      Thanks bluecompassrose for your help... (also thank you for posting the bwa file input for me - you wrote 'output' instead) but I am not sure if it is efficient for millions of sequences (basically >2GB of text files).. Any efficient ideas? please let me know.

        Haha, yup. I'm terrible at perl. Java programmer at heart =P. That said, take a look at Marshall's code and my comment above, it should work for what you want it to do o.ob