Re^4: Is it possible to make reference to substrings and sort them?

Replies are listed 'Best First'.
Re^5: Is it possible to make reference to substrings and sort them? by BrowserUk (Patriarch) on Mar 23, 2015 at 05:40 UTC
Suffix trie is not efficient in terms of memory, as the trie needs to store all suffixes from 1 to length(text) Hm. Maybe I'm misreading the paper, but my reading suggests that constructing a BWT the classical method -- construct all the rotations and then sort them -- takes quadratic time, which is why the normal practice is now to construct a prefix trie first; and the BWT from that (to save time). And further, the problem of suffux tries being inefficient of space ("nlog2n bits of working space which amounts to 12 GB for human genome") is fixed by Hon et al. (2007) which requires "only requires <1 GB memory at peak time for constructing the BWT of human genome". The algorithm shown in Figure 2 is quadratic in time and space. However, this is not necessary. In practice, we usually construct the suffix array first and then generate BWT. Most algorithms for constructing suffix array require at least nlog2n bits of working space, which amounts to 12 GB for human genome. Recently, Hon et al. (2007) gave a new algorithm that uses n bits of working space and only requires <1 GB memory at peak time for constructing the BWT of human genome . This algorithm is implemented in BWT-SW (Lam et al., 2008). We adapted its source code to make it work with BWA. Anyway. How many mismatches do you allow and still consider a site to match one of your 20nt short reads? I have code that can find all the 1-mismatch sites for each of 10,000 x 20nt short reads. in a 5Mnt strand, in 1.5 seconds using 1/4GB memory: `C:\test>1120869 -BIG=5e6 -FUZZ=1 -MIN=10 -NSHORTS=10000 \| wc -l Indexed search took 1.509765148 secs 2` [download] Allowing two mismatches per requires 6 1/2 minutes : C:\test>1120869 -BIG=5e6 -FUZZ=2 -MIN=6 -NSHORTS=10000 \| wc -l Indexed search took 395.132736921 secs 80 C:\test>1120869 -BIG=5e6 -FUZZ=2 -MIN=6 -NSHORTS=10000 AAACTTGTTACCGGGGTACT ~ TAACTTGTTACCGGGTTACT ( AACTTGTTACCGGG TACT) [18 + of 20] at 1423486 CTCGCCATGTAAGTTATGGT ~ CCCGCCATGAAAGTTATGGT (C CGCCATG AAGTTATGGT) [18 + of 20] at 3739837 ACTCAGGAGCATATGAGCAA ~ ACTCGGGAGCATATGAGCAG (ACTC GGAGCATATGAGCA ) [18 + of 20] at 117017 GCGCACTACAACCTCTTAAA ~ GCGCACTACAACGTCTTATA (GCGCACTACAAC TCTTA A) [18 + of 20] at 2758181 CCAGTATATGATACGAACGC ~ CCAGTATCTGGTACGAACGC (CCAGTAT TG TACGAACGC) [18 + of 20] at 2681307 CTCGCACGATTACGACCGAA ~ CTCGCAGAATTACGACCGAA (CTCGCA ATTACGACCGAA) [18 + of 20] at 4658451 ... [download] And allowing 3 requires 30minutes: `C:\test>1120869 -BIG=5e6 -FUZZ=3 -MIN=5 -NSHORTS=10000 \| wc -l Indexed search took 1763.693833113 secs 1455` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^6: Is it possible to make reference to substrings and sort them? by spooknick (Novice) on Mar 23, 2015 at 08:21 UTC
No, I don't think you didn't misread the paper, but the suffix trie is not the same as the suffix tree or suffix array. I thought it had to be corrected, in case someone reads the post and misunderstands it, sorry about that. I have to find all ready that match with no mismatch. I'd be interested in your code.	[reply]
Re^7: Is it possible to make reference to substrings and sort them? by BrowserUk (Patriarch) on Mar 23, 2015 at 11:00 UTC
I have to find all ready that match with no mismatch If you're only looking for exact matches, something like this does the job in ~ 7 seconds (including building the index): #! perl -slw use strict; use Time::HiRes qw[ time ]; our $MIN //= 5; our $SHORT //= 20; our $NSHORTS //= 10e3; our $BIG //= 5e6; our $S //= 0; $S and srand( $S ); sub rndDna{ my $dna =''; $dna .= ( qw[A C G T] )[ rand 4 ] for 1 .. $_ +[0]; return $dna; } my $bigDna = rndDna( $BIG ); my @shorts = map rndDna( $SHORT ), 1 .. $NSHORTS; my $start = time; my %idx; push @{ $idx{ substr $bigDna, $_, $MIN } }, $_ for 0 .. lengt +h( $bigDna ) - $MIN; printf STDERR "Indexing took %.9f secs\n", time() -$start ; $start = time; my $found = 0; for my $short ( @shorts ) { for my $pos ( @{ $idx{ my $pre = substr $short, 0, $MIN } } ) { if( substr( $bigDna, $pos, $SHORT ) eq $short ) { print "$short found at $pos"; ++$found; } } } print STDERR "Found: ", $found; printf STDERR "Lookup of $NSHORTS $SHORT-nt short reads against a $BIG +-nt sequence took %.9f secs\n", time() -$start ; __END__ C:\test>1120869-2 -BIG=5e6 -MIN=5 -SHORT=14 -NSHORTS=10000 -S=1 \| wc - +l Indexing took 3.365217924 secs Found: 182 Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 25.56 +8705082 secs 182 C:\test>1120869-2 -BIG=5e6 -MIN=6 -SHORT=14 -NSHORTS=10000 -S=1 \| wc - +l Indexing took 3.351562023 secs Found: 182 Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 6.666 +512966 secs 182 C:\test>1120869-2 -BIG=5e6 -MIN=7 -SHORT=14 -NSHORTS=10000 -S=1 \| wc - +l Indexing took 4.609364986 secs Found: 182 Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 1.699 +807882 secs 182 C:\test>1120869-2 -BIG=5e6 -MIN=7 -SHORT=14 -NSHORTS=10000 -S=1 \| wc - +l Indexing took 4.519516945 secs Found: 182 Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 1.682 +117939 secs 182 C:\test>1120869-2 -BIG=5e6 -MIN=8 -SHORT=14 -NSHORTS=10000 -S=1 \| wc - +l Indexing took 6.175763845 secs Found: 182 Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 0.424 +978018 secs 182 C:\test>1120869-2 -BIG=5e6 -MIN=10 -SHORT=14 -NSHORTS=10000 -S=1 \| wc +-l Indexing took 9.214743853 secs Found: 182 Lookup of 10000 14-nt short reads against a 5e6-nt sequence took 0.058 +255196 secs 182 [download] Note: I'm using 14nt short reads because any longer and none are found in my randomly generated test sequence. You can reduce the lookup time by increasing number of characters you index; with the trade off that the indexing take longer and uses more memory: `1 x 5e6nt sequence; 10e3 x 14nt short reads; Index prefix: indexing(secs): lookup(secs): found: memory(MB): 5 3.37 25.57 182 200 6 3.38 6.66 182 210 7 4.06 1.69 182 230 8 6.17 0.42 182 244 10 9.21 0.05 182 450` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]