in reply to string manipulation with Regex

I've been looking at the code for quite a while, and I've found a (messy) solution to part 1 of 2 of the problem.

Part 1: Make a way to access the total number of bases, number of insertions, deletions, and matches. Make a way to determine where the insertions and matches exist. That has been the toughest, literally needing me to play around with extra hashes.

Part 2: To insert X's in deleted segments of the sequence. This is a bit easier now that the location and length of the deletions is known, but I dont know on how to "insert" characters into the middle of a string (is there a "string tokenizer" style of code for perl like there is in java?)

That said, here's what I figured out after about 12 hours of sitting and coding (I'm not good at it D:)

use warnings; use strict; use Data::Dumper; ## CAUTION: # of sequences in BWA file and # of sequences in input seq +uencing data should be equal!!## # BWA alignment output (.sam) my %bwa = (); my @splitstore=(); my @splitstoreMatch=(); my @splitstoreDel=(); my @splitstoreIns=(); my @splitcheck=(); my %delxstore=(); my %delystore=(); my %insxstore=(); my %insystore=(); my $ssl=(0); my $file1 = shift; open (FILE1, "$file1") || die "Failed to open $file1 for reading : $!" +; # Open second file while (<FILE1>) { # Reading second hash if ($_ =~ /^[^@]/s) { chomp; my @line = split /\s+/, $_; my $ID; if ($line[2] =~ /[^*]/) { $ID = $line[0]; push @{$bwa{$ID}}, @line[0,2,5,12,15,16,17,18,3]; #$bwa{$ID}[0] = $line[0]; # seq ID #$bwa{$ID}[1] = $line[2]; # Ref ID #$bwa{$ID}[2] = $line[5]; # CIGAR ID for insertion #$bwa{$ID}[3] = @line[9]; # Processed seq (already C->T) #$bwa{$ID}[3] = $line[12]; # Edit distance (edited area by # +of base) : NM #$bwa{$ID}[4] = $line[15]; # No. of mismatches in the alignme +nt : XM #$bwa{$ID}[5] = $line[16]; # No. of gap opens for insertion : + XO #$bwa{$ID}[6] = $line[17]; # No. of gap extensions for deleti +on :XG #$bwa{$ID}[7] = $line[18]; # Mismatching positions / bases : +MD #$bwa{$ID}[8] = $line[3]; # Starting position #$bwa{$ID}[10] = Total Number of Bases #$bwa{$ID}[11] = Deletions (0 if no) #$bwa{$ID}[12] = Insertions (0 if no) #$insstore{$ID}[X] = X,Y of bases to match, then length of de +letion #$delstore{$ID}[X] = X,Y of bases until insertion, then lengt +h of insertion # Modifying CIGAR ID's to keep track of matches, deletions, i +nsertions @splitstore = split /[A-Z]/, $bwa{$ID}[2]; #just the values $ssl = scalar @splitstore; @splitstoreDel = split /[^A-Z]+[M,I](?![D])/, $bwa{$ID}[2]; c +hop(@splitstoreDel); #just the deletions @splitstoreIns = split /[^A-Z]+[M,D](?![I])/, $bwa{$ID}[2]; c +hop(@splitstoreIns); #just the insertions @splitstoreMatch = split /[^A-Z]+[D,I](?![M])/, $bwa{$ID}[2]; + chop(@splitstoreMatch); #just the matches $splitcheck[0] = $bwa{$ID}[2]; #Internal storage of insertion and deletions foreach my $i (@splitstoreDel){ unless($i eq ""){ $bwa{$ID}[11] +=1;} #define number of insertions } foreach my $i (@splitstoreIns){ unless($i eq ""){ $bwa{$ID}[12] +=1;} #define number of insertions } #Storing Total Bases foreach my $i (@splitstoreIns){ #for each insertion unless($i eq ""){ $bwa{$ID}[10] +=$i;} } foreach my $i (@splitstoreMatch){ #for each match unless($i eq ""){ $bwa{$ID}[10] +=$i;} } #Define positions of Deletions and insertions my $tempM = 0; for(my $i = 0;$i<$ssl;$i++){ if ($splitcheck[0] =~ /^[^A-Z]+[M]/) { #if first sectio +n match @splitcheck = split /^[^A-Z]+[M]/, $splitcheck[0]; +#eat first section shift(@splitcheck); #get r +id of "" $tempM = $splitstore[$i] + $tempM; #shift number of + matched bases } elsif ($splitcheck[0] =~ /^[^A-Z]+[D]/){ #if first sect +ion deletion @splitcheck = split /^[^A-Z]+[D]/, $splitcheck[0]; +#eat first section shift(@splitcheck); + #get rid of "" push @{$delxstore{$ID}}, $tempM; ##Note: need to s +tore: ID, X position (number of bases before deletion), Y position (n +umber of bases deleted) push @{$delystore{$ID}}, $splitstore[$i]; $tempM = $splitstore[$i] + $tempM; #shift number +of matched bases } elsif ($splitcheck[0] =~ /^[^A-Z]+[I]/){ #if first sect +ion insertion @splitcheck = split /^[^A-Z]+[I]/, $splitcheck[0]; +#eat first section shift(@splitcheck); + #get rid of "" push @{$insxstore{$ID}}, $tempM; push @{$insystore{$ID}}, $splitstore[$i]; ##Note: +need to store: ID, X position (number of bases before insertion), Y p +osition (number of bases ins) $tempM = $splitstore[$i] + $tempM; #shift number +of matched bases } } ## Bug Checking #print "\n\t@splitstore\t\t"; #print "@splitstoreDel\t\t"; #print "@splitstoreIns\t\t"; #print "$bwa{$ID}[10]\t\t\t"; #if (exists $delxstore{$ID}[1]){print "$delxstore{$ID}[1]\ +t";} #if (exists $delystore{$ID}[1]){print "$delystore{$ID}[1]\ +n";} #if (exists $insxstore{$ID}[1]){print "$insxstore{$ID}[1]\ +t";} #if (exists $insystore{$ID}[1]){print "$insystore{$ID}[1]\ +n";} #print "$insstore{$ID}\n"; } } } close FILE1 || die "Failed to close $file1 : $!";

Here's a bit of input

@SQ SN:TMEM200B LN:293 @SQ SN:B3GAT2-2_P001 LN:204 Seq1Perfect 0 B3GAT2-2_P001 1 37 500M400D300M200I100M + * 0 0 GGTTGGTTTTTATTTTTTGGAAGAGTTTTAGATTATAGGTGTTGTCGTCGTT +AGCGAAGAAGAGTACGTCGGGTTGCGCGCGTTGGTGTTGGTGTTTTTGGCGTAGTTAGGCGAGGTTCGC +GTTGCGTTGTTTAGTGGCGCGCGGTAGTTCGGGTCGTTTGTAGCGTCGCGGCGTGGGTACGTGTAGGTG +AGTGTTGGGTAGTT &a==aa=====a======aaaaaaa====aaa==a=aaa=a==a=$a=$a= +=aa$aaaaaaaaa=a$a=$aaa==a$a$a$a==aa=a==aa=a=====aa$a=aa==aaa$aaaa==$a +$a==a$a==a===aa=aa$a$a$aa=aa==$aaa=$a===a=aa$a=$a$aa$a=aaa=a$a=a=aaa= +aaa=a==aaa=aa==

The huge XXX M number is inflated for the purposes of bug testing (you can replace it with something more sensible, say 55M6D143M), since there may or may not be one or more D's, I's and M's in the CIGAR ID. Note, there won't be two letters in a row (which is good, because my code fails there). The thing I did was get the various information out of this little string into variables that could be used to later insert X's into the Deletions: the XXX D numbers.

All that said my code is a mess. I dont know how to optimize or how to insert characters into a string, midstring. I can't help much more on my own here, so =(

Replies are listed 'Best First'.
Re^2: string manipulation with Regex
by FluffyBunny (Acolyte) on Sep 01, 2010 at 20:01 UTC

    Thanks bluecompassrose for your help... (also thank you for posting the bwa file input for me - you wrote 'output' instead) but I am not sure if it is efficient for millions of sequences (basically >2GB of text files).. Any efficient ideas? please let me know.

      Haha, yup. I'm terrible at perl. Java programmer at heart =P. That said, take a look at Marshall's code and my comment above, it should work for what you want it to do o.ob