Mustering Reads

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Mustering Reads by Corion (Patriarch) on Nov 01, 2010 at 10:14 UTC
Likely, substr is all you need, if you already know what parts overlap and what parts do not overlap. Once you have the non-overlapping parts, you can go through them character by character and if they match, you output that character, and if they don't match, you output the pair. I think there can only be two pairs anyway, so a fancy trick would be to use binary XOR on the "nonmatched" parts: `use strict; my @nonmatched = qw(TACTCAGGAG TACTCGGGAG); my $difference = $nonmatched[0] ^ $nonmatched[ 1 ]; # Replace all zeroes (= matches) by the original string: $difference =~ s/\0/substr($nonmatched[0],pos($difference),1)/ge; # Replace the "weird" parts with the proper pairs my %pairs = ( 'C' ^ 'T' => '(C,T)', 'A' ^ 'G' => '(A,G)', ); $difference =~ s/(.)/$pairs{$1} \|\| $1/ge; print $difference;` [download]	[reply] [d/l]
Re: Mustering Reads by JavaFan (Canon) on Nov 01, 2010 at 12:16 UTC
The following does seem to do the trick. It should work for any number of "repo"s. It does assume all are of the same length, but it wouldn't be complicated to adjust for varying lengths: use 5.010; use strict; use warnings; my $k = 25; my %readrepo =( readA => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA", readB => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC", readC => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG", readD => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC", ); my @readstoconcate = ( "readA_1", "readB_2", "readC_1", "readD_2", ); my @prefixes; my @postfixes; my $fixed; foreach my $read (@readstoconcate) { my ($entry, $tag) = split '_', $read; if ($tag == 1) { push @postfixes, substr $readrepo{$entry}, $k; $fixed \|\|= substr $readrepo{$entry}, 0, $k; } else { my $k1 = length($readrepo{$entry}) - $k; push @prefixes, substr $readrepo{$entry}, 0, $k1; $fixed \|\|= substr $readrepo{$entry}, $k1; } } # # Assume all entries in %readrepo are the same length. # foreach my $set (\@prefixes, \@postfixes) { next unless @$set; for (my $i = 0; $i < length $$set[0]; $i++) { my $l = "\x0"; my @c = grep {my $r = ($_ ne $l); $l = $_; $r} map {substr $_, $i, 1} @$set; local $" = ","; print @c == 1 ? $c[0] : "(@c)"; } } continue { state $flag = 0; print $fixed unless $flag++; } print "\n"; __END__ TACTC(A,G)GGAGGCTGAGGCAGGAGAATTGCTTGAAC(C,T)T(G,A)GG(A,G)G(G,A)(C,T)(A +,G) [download]	[reply] [d/l]
Re: Mustering Reads by BrowserUk (Patriarch) on Nov 01, 2010 at 11:12 UTC
This produces a slightly different output for your sample than you provided--but I think mine is right :) (You'll have to explain to me why it isn't, if it isn't.) This could probably be made quite a bit more efficient with further thought, but I wanted to check if it is correct first. #! perl -slw use strict; my $k = 25; my %repo = ( "readA" => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA", "readB" => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC", "readC" => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG", "readD" => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC", ); my @order = ( "readA_1", "readB_2", "readC_1", "readD_2"); my( @heads, @tails, $common ); while( @order ) { my( $s1, $p1, $s2, $p2 ) = map split( '_', shift @order ), 1 .. 2; ( $s1, $s2 ) = ( $s2, $s1 ) if $p1 > $p2; push @heads, substr $repo{ $s2 }, 0, length( $repo{ $s2 } ) - $k; push @tails, substr $repo{ $s1 }, $k; $common = substr $repo{ $s1 }, -$k unless $common; } my $head = ''; for my $p ( 0 .. length( $heads[0] )-1 ) { my %uniq; ++$uniq{ substr $heads[ $_ ], $p, 1 } for 0 .. $#heads; if( keys %uniq > 1 ) { $head .= '(' . join( ',', keys %uniq ) . ')'; } else { $head .= each %uniq; } } my $tail = ''; for my $p ( 0 .. length( $tails[0] )-1 ) { my %uniq; ++$uniq{ substr $tails[ $_ ], $p, 1 } for 0 .. $#tails; if( keys %uniq > 1 ) { $tail .= '(' . join( ',', keys %uniq ) . ')'; } else { $tail .= each %uniq; } } print $head, $common, $tail; __END__ c:\test>868716 TACTC(A,G)GGAGGAGAATTGCTTGAACCTGGGAGGCA(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A +,G) [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^2: Mustering Reads by neversaint (Deacon) on Nov 02, 2010 at 07:30 UTC
Dear BrowserUK, I was trying the following input to your code above: `my $k = 15; my %repo = ( "readA" => "AAACATGAAAAGAAATGATGAAACAGA", "readB" => "AAATCATGAAACAGAGCCTCATCTCCC ", ); my @order = ( "readB_1","readA_2");` [download] Why it gives this output `AAACATGAAAAGGAGCCTCATCTCCC GCCTCATCTCCC` [download] Instead of: `AAACATGAAAAGAAAT(G,C)ATGAAACAGAGCCTCATCTCCC` [download] Following this alignment: `readA AAACATGAAAAGAAATGATGAAACAGA readB AAATCATGAAACAGAGCCTCATCTCCC` [download] How can I fix it? --- neversaint and everlastingly indebted.......	[reply] [d/l] [select]
Re^3: Mustering Reads by BrowserUk (Patriarch) on Nov 02, 2010 at 08:17 UTC
How can I fix it? By giving better examples :) I had assumed on the basis of the sample supplied, that the overlapping sections were identical as in both cases supplied. I think javafans make that same assumption. You'd need to accumulate the overlaps in a third array in the while, and then add another for loop to process them is the same way as @heads & @tails. (There obviously scope there for making the for loop a subroutine. I'm done for today, but I'll be back tonight.	[reply]
Re^3: Mustering Reads by BrowserUk (Patriarch) on Nov 02, 2010 at 14:34 UTC
As I suggested: #! perl -slw use strict; use Data::Dump qw[ pp ]; my $k = 15; my %repo = ( "readA" => "AAACATGAAAAGAAATGATGAAACAGA", "readB" => "AAATCATGAAACAGAGCCTCATCTCCC ", ); my @order = ( "readB_1","readA_2"); #my $k = 25; #my %repo = ( # "readA" => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA", # "readB" => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC", # "readC" => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG", # "readD" => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC", #); #my @order = ( "readA_1", "readB_2", "readC_1", "readD_2"); sub muster { local $^W; my( $bits ) = @_; my $muster = ''; for my $p ( 0 .. length( $bits->[0] )-1 ) { my %uniq; ++$uniq{ substr $bits->[ $_ ], $p, 1 } for 0 .. $#$bits; if( keys %uniq > 1 ) { $muster .= '(' . join( ',', keys %uniq ) . ')'; } else { $muster .= each %uniq; } } return $muster; } my( @heads, @bodys, @tails ); while( @order ) { my( $s1, $p1, $s2, $p2 ) = map split( '_', shift @order ), 1 .. 2; ( $s1, $s2 ) = ( $s2, $s1 ) if $p1 > $p2; push @heads, substr $repo{ $s2 }, 0, length( $repo{ $s2 } ) - $k; push @bodys, substr $repo{ $s2 }, -$k; push @tails, substr $repo{ $s1 }, $k; push @bodys, substr $repo{ $s1 }, 0, $k; } my $head = muster( \@heads ); my $body = muster( \@bodys ); my $tail = muster( \@tails ); print $head, $body, $tail; __END__ TACTC(A,G)GGAGGAGAATTGCTTGAACCTGGGAGGCA(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A +,G) TACTC(A,G)GGAGGCTGAGGCAGGAGAATTGCTTGAAC(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A +,G) AAACATGAAAAGAAAT(C,G)ATGAAACAGAGCCTCATCTCCC AAACATGAAAAGAAAT(G,C)ATGAAACAGAGCCTCATCTCCC [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re: Mustering Reads (definition?) by LanX (Saint) on Nov 01, 2010 at 11:11 UTC
Sorry, could you please explain the meaning of "mustering" or provide a link with a definition? I tried to google it, but without much success and I'm not proficient in the arts of bioinformatics. Cheers Rolf	[reply]


Perl Monk, Perl Meditation
	PerlMonks