Re: string manipulation with Regex

I've been looking at the code for quite a while, and I've found a (messy) solution to part 1 of 2 of the problem.

Part 1: Make a way to access the total number of bases, number of insertions, deletions, and matches. Make a way to determine where the insertions and matches exist. That has been the toughest, literally needing me to play around with extra hashes.

Part 2: To insert X's in deleted segments of the sequence. This is a bit easier now that the location and length of the deletions is known, but I dont know on how to "insert" characters into the middle of a string (is there a "string tokenizer" style of code for perl like there is in java?)

That said, here's what I figured out after about 12 hours of sitting and coding (I'm not good at it D:)

use warnings;
use strict;
use Data::Dumper;

## CAUTION: # of sequences in BWA file and # of sequences in input seq
+uencing data should be equal!!##

# BWA alignment output (.sam)

my %bwa = ();

my @splitstore=();
my @splitstoreMatch=();
my @splitstoreDel=();
my @splitstoreIns=();
my @splitcheck=();
my %delxstore=();
my %delystore=();
my %insxstore=();
my %insystore=();
my $ssl=(0);

my $file1 = shift;
open (FILE1, "$file1") || die "Failed to open $file1 for reading : $!"
+; # Open second file

while (<FILE1>) { # Reading second hash
  if ($_ =~ /^[^@]/s) {
     chomp;
     my @line = split /\s+/, $_;
     my $ID;
     if ($line[2] =~ /[^*]/) {
         $ID = $line[0];
         push @{$bwa{$ID}}, @line[0,2,5,12,15,16,17,18,3];

         #$bwa{$ID}[0] = $line[0];  # seq ID
         #$bwa{$ID}[1] = $line[2];  # Ref ID
         #$bwa{$ID}[2] = $line[5];  # CIGAR ID for insertion
         #$bwa{$ID}[3] = @line[9];  # Processed seq (already C->T)
         #$bwa{$ID}[3] = $line[12]; # Edit distance (edited area by # 
+of base) : NM
         #$bwa{$ID}[4] = $line[15]; # No. of mismatches in the alignme
+nt : XM
         #$bwa{$ID}[5] = $line[16]; # No. of gap opens for insertion :
+ XO
         #$bwa{$ID}[6] = $line[17]; # No. of gap extensions for deleti
+on :XG
         #$bwa{$ID}[7] = $line[18]; # Mismatching positions / bases : 
+MD
         #$bwa{$ID}[8] = $line[3];  # Starting position
            
         #$bwa{$ID}[10] = Total Number of Bases
         #$bwa{$ID}[11] = Deletions (0 if no)
         #$bwa{$ID}[12] = Insertions (0 if no)
         #$insstore{$ID}[X] = X,Y of bases to match, then length of de
+letion
         #$delstore{$ID}[X] = X,Y of bases until insertion, then lengt
+h of insertion
            
         # Modifying CIGAR ID's to keep track of matches, deletions, i
+nsertions
         @splitstore = split /[A-Z]/, $bwa{$ID}[2]; #just the values
         $ssl = scalar @splitstore;
         @splitstoreDel = split /[^A-Z]+[M,I](?![D])/, $bwa{$ID}[2]; c
+hop(@splitstoreDel); #just the deletions
         @splitstoreIns = split /[^A-Z]+[M,D](?![I])/, $bwa{$ID}[2]; c
+hop(@splitstoreIns); #just the insertions
         @splitstoreMatch = split /[^A-Z]+[D,I](?![M])/, $bwa{$ID}[2];
+ chop(@splitstoreMatch); #just the matches
         $splitcheck[0] = $bwa{$ID}[2];
                #Internal storage of insertion and deletions
                foreach my $i (@splitstoreDel){
                    unless($i eq ""){
                    $bwa{$ID}[11] +=1;} #define number of insertions
                }
                foreach my $i (@splitstoreIns){
                    unless($i eq ""){
                    $bwa{$ID}[12] +=1;} #define number of insertions
                }
                
                #Storing Total Bases
                foreach my $i (@splitstoreIns){ #for each insertion
                    unless($i eq ""){
                    $bwa{$ID}[10] +=$i;}
                }
                foreach my $i (@splitstoreMatch){ #for each match
                    unless($i eq ""){
                    $bwa{$ID}[10] +=$i;}
                }
                
                #Define positions of Deletions and insertions
               my $tempM = 0;
               for(my $i = 0;$i<$ssl;$i++){
               if ($splitcheck[0] =~ /^[^A-Z]+[M]/) { #if first sectio
+n match
                   @splitcheck = split /^[^A-Z]+[M]/, $splitcheck[0]; 
+#eat first section
                        shift(@splitcheck);                     #get r
+id of ""
                   $tempM = $splitstore[$i] + $tempM; #shift number of
+ matched bases
               }
               elsif ($splitcheck[0] =~ /^[^A-Z]+[D]/){ #if first sect
+ion deletion
                   @splitcheck = split /^[^A-Z]+[D]/, $splitcheck[0]; 
+#eat first section
                        shift(@splitcheck);                           
+  #get rid of ""
                   push @{$delxstore{$ID}}, $tempM;  ##Note: need to s
+tore: ID, X position (number of bases before deletion), Y position (n
+umber of bases deleted)
                   push @{$delystore{$ID}}, $splitstore[$i];
                   $tempM = $splitstore[$i] + $tempM;   #shift number 
+of matched bases
                    }
               elsif ($splitcheck[0] =~ /^[^A-Z]+[I]/){ #if first sect
+ion insertion
                   @splitcheck = split /^[^A-Z]+[I]/, $splitcheck[0]; 
+#eat first section
                        shift(@splitcheck);                           
+  #get rid of ""
                   push @{$insxstore{$ID}}, $tempM;
                   push @{$insystore{$ID}}, $splitstore[$i];  ##Note: 
+need to store: ID, X position (number of bases before insertion), Y p
+osition (number of bases ins)
                   $tempM = $splitstore[$i] + $tempM;   #shift number 
+of matched bases
                    }    
                }    
                
            ## Bug Checking
            #print "\n\t@splitstore\t\t";
            #print "@splitstoreDel\t\t";
            #print "@splitstoreIns\t\t";
            #print "$bwa{$ID}[10]\t\t\t";
            #if (exists $delxstore{$ID}[1]){print "$delxstore{$ID}[1]\
+t";}
            #if (exists $delystore{$ID}[1]){print "$delystore{$ID}[1]\
+n";}
            #if (exists $insxstore{$ID}[1]){print "$insxstore{$ID}[1]\
+t";}
            #if (exists $insystore{$ID}[1]){print "$insystore{$ID}[1]\
+n";}
            #print "$insstore{$ID}\n";
        }
    }
}
close FILE1 || die "Failed to close $file1 : $!";
[download]

Here's a bit of input

@SQ    SN:TMEM200B    LN:293
@SQ    SN:B3GAT2-2_P001    LN:204
Seq1Perfect    0    B3GAT2-2_P001    1    37    500M400D300M200I100M  
+  *    0    0    GGTTGGTTTTTATTTTTTGGAAGAGTTTTAGATTATAGGTGTTGTCGTCGTT
+AGCGAAGAAGAGTACGTCGGGTTGCGCGCGTTGGTGTTGGTGTTTTTGGCGTAGTTAGGCGAGGTTCGC
+GTTGCGTTGTTTAGTGGCGCGCGGTAGTTCGGGTCGTTTGTAGCGTCGCGGCGTGGGTACGTGTAGGTG
+AGTGTTGGGTAGTT    &a==aa=====a======aaaaaaa====aaa==a=aaa=a==a=$a=$a=
+=aa$aaaaaaaaa=a$a=$aaa==a$a$a$a==aa=a==aa=a=====aa$a=aa==aaa$aaaa==$a
+$a==a$a==a===aa=aa$a$a$aa=aa==$aaa=$a===a=aa$a=$a$aa$a=aaa=a$a=a=aaa=
+aaa=a==aaa=aa==
[download]

The huge XXX M number is inflated for the purposes of bug testing (you can replace it with something more sensible, say 55M6D143M), since there may or may not be one or more D's, I's and M's in the CIGAR ID. Note, there won't be two letters in a row (which is good, because my code fails there). The thing I did was get the various information out of this little string into variables that could be used to later insert X's into the Deletions: the XXX D numbers.

All that said my code is a mess. I dont know how to optimize or how to insert characters into a string, midstring. I can't help much more on my own here, so =(

Comment on Re: string manipulation with Regex Select or Download Code

Replies are listed 'Best First'.
Re^2: string manipulation with Regex by FluffyBunny (Acolyte) on Sep 01, 2010 at 20:01 UTC
Thanks bluecompassrose for your help... (also thank you for posting the bwa file input for me - you wrote 'output' instead) but I am not sure if it is efficient for millions of sequences (basically >2GB of text files).. Any efficient ideas? please let me know.	[reply]
Re^3: string manipulation with Regex by bluecompassrose (Initiate) on Sep 02, 2010 at 16:10 UTC
Haha, yup. I'm terrible at perl. Java programmer at heart =P. That said, take a look at Marshall's code and my comment above, it should work for what you want it to do o.ob	[reply]