FluffyBunny has asked for the wisdom of the Perl Monks concerning the following question:
Hello Perl monks,
A noob bioinfomatics student (yes, me) needs your knowledge! I am trying to manipulate string (no space between) with string+integer input(also no space between)using regular expression.
From an alignment program, I get mix of string+integer (called CIGAR ID) and this is after comparing an input string to a reference string. For example, you get "60M2D142M" and this means "60 match, 2 deletion, 142 match" in my input string compared to a reference string (direction: from left to right). Another example is "12M1I192M" and this means "12 match 1 insertion 192 match" in my input string. The length of this string matches to the reference when the integers are added together for deletion (for example, 60+2+142=204 for first example) and total not including number of insertion for insertion case (12+192=204 for 2nd example).
I know parsing this by number and letter (M,D,I) is easy.
split /[A-Z]/So my first example after parsing will be an array of indexes "60 M 2 D 142 M" However, I need to use all these info so I can make sure to skip 60 letters and add two Xs to make up the deletion of two letters and skip 142 (this will make the same length as the reference string. For my second example, I will get "12 M 1 I 192 M" and I need to get rid of this insertion after 12 letters and skip the rest of letters to make the same length as the reference.
My code is already getting complicated as I have to worry about this CIGAR ID that comes with multiple "I"s and "D"s..
this is a part of my code:
my %bwa = (); my @splitstore=(); my @s1 = (); my @s2 = (); my @s3 = (); my @s4 = (); my @s5 = (); my $ssl=(0); my $splitstoreD=(0); my $splitstoreI=(0); my $file1 = shift; open (FILE1, "$file1") || die "Failed to open $file1 for reading : $!" +; # Open second file while (<FILE1>) { # Reading second hash if ($_ =~ /^[^@]/s) { chomp; my @line = split /\s+/, $_; my $ID; if ($line[2] =~ /[^*]/) { $ID = $line[0]; push @{$bwa{$ID}}, @line[0,2,5,12,15,16,17,18,3]; #$bwa{$ID}[0] = $line[0]; # seq ID #$bwa{$ID}[1] = $line[2]; # Ref ID #$bwa{$ID}[2] = $line[5]; # CIGAR ID for insertion #$bwa{$ID}[3] = @line[9]; # Processed seq (already C->T) #$bwa{$ID}[3] = $line[12]; # Edit distance (edited area by + # of base) : NM #$bwa{$ID}[4] = $line[15]; # No. of mismatches in the alig +nment : XM #$bwa{$ID}[5] = $line[16]; # No. of gap opens for insertio +n : XO #$bwa{$ID}[6] = $line[17]; # No. of gap extensions for del +etion :XG #$bwa{$ID}[7] = $line[18]; # Mismatching positions / bases + : MD #$bwa{$ID}[8] = $line[3]; # Starting position # These are not from my input, I made these so I can save +some data for parsing the CIGAR ID. #$bwa{$ID}[10] = Total Number of Bases #$bwa{$ID}[11] = Deletion (0 if no) #$bwa{$ID}[12] = Insertion (0 if no) #$bwa{$ID}[13] = Number of bases until deletion #$bwa{$ID}[14] = Number of bases until insertion # Modifying CIGAR ID's to keep track of matches, deletions +, insertions @splitstore = split /[A-Z]/, $bwa{$ID}[2]; if ($bwa{$ID}[2] =~ /[D]/) { $splitstoreD = 1; } else { $s +plitstoreD = 0 } if ($bwa{$ID}[2] =~ /[I]/) { $splitstoreI = 1; } else { $s +plitstoreI = 0 } #Storing Total Bases $ssl = scalar @splitstore; if($ssl>1){ if ($ssl>4){ #If there is both a deletion and insertio +n... $bwa{$ID}[10] = $splitstore[0]+$splitstore[2]+$spl +itstore[$ssl-1]; #add first, third and last nubmers to get total base +s } else{ $bwa{$ID}[10] = $splitstore[0]+$splitstore[$ssl-1];} # +add first and last numbers to get total bases } else { $bwa{$ID}[10] = $splitstore[0]; } #just the first n +umber #Defining Deletions if($splitstoreD == 1) { $bwa{$ID}[11]="Del" } else { $bwa{ +$ID}[11]=0 } #Defining Insertions if($splitstoreI == 1) { $bwa{$ID}[12]="Ins" } else { $bwa{ +$ID}[12]=0 } #Defining Position of Deletion if ($bwa{$ID}[2] =~ /(\d*(?=D))/) { @s1 = split /(\d)/, $bwa{$ID}[2]; @s2 = split /(\d+M)/, @s1; @s3 = split /(\d+M)/, @s2; @s4 = split /(\d+I)/, @s3; @s5 = split /[D]/, @s4; $bwa{$ID}[13]=@s5; } print "@splitstore\t"; print "$splitstoreI\t"; print "$splitstoreD\t"; print "$bwa{$ID}[10]\n"; print "@s1\t"; print "@s2\t"; print "@s3\t"; print "@s4\t"; print "@s5\t"; print "$bwa{$ID}[13]\n"; @s1 = (); @s2 = (); @s3 = (); @s4 = (); @s5 = (); } } } close FILE1 || die "Failed to close $file1 : $!";
This is not completed yet. I am overwhelmed by multiple Is and Ds case...Is there a better way to do? Please help.
|
|---|