It's for bioinformatics, so the alphabet is 4 letters, and the text can be up to 5M characters in my case.
| [reply] |
the text can be up to 5M characters
No matter how you generate them, just storing the 5 million rotations of 5 million characters (>23 Terabytes of data) is going to be a problem, long before you get to the point of trying to sort.
If the purpose of your BWT generation is to aid compression, then there are much more effective and efficient methods of compressing genomic data.
If the purpose is Short Read Alignment, you almost certainly need to look at the suffix trie methods described there.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] |
Yes, this is why I wouldn't want to generate all the rotations.
Short read alignments is what I need to do. I have a few thousand short sequences, length of ~20nt. I want to find those that are present in my text, as well as their position. BWT is the way to go.
Suffix trie is not efficient in terms of memory, as the trie needs to store all suffixes from 1 to length(text). Thank you for the link to the paper.
| [reply] |