in reply to Is it possible to make reference to substrings and sort them?

problematic when the text is large.

How big is your alphabet? How big is "large text"?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
  • Comment on Re: Is it possible to make reference to substrings and sort them?

Replies are listed 'Best First'.
Re^2: Is it possible to make reference to substrings and sort them?
by spooknick (Novice) on Mar 22, 2015 at 12:10 UTC

    It's for bioinformatics, so the alphabet is 4 letters, and the text can be up to 5M characters in my case.

      the text can be up to 5M characters

      No matter how you generate them, just storing the 5 million rotations of 5 million characters (>23 Terabytes of data) is going to be a problem, long before you get to the point of trying to sort.

      If the purpose of your BWT generation is to aid compression, then there are much more effective and efficient methods of compressing genomic data.

      If the purpose is Short Read Alignment, you almost certainly need to look at the suffix trie methods described there.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        Yes, this is why I wouldn't want to generate all the rotations.

        Short read alignments is what I need to do. I have a few thousand short sequences, length of ~20nt. I want to find those that are present in my text, as well as their position. BWT is the way to go.

        Suffix trie is not efficient in terms of memory, as the trie needs to store all suffixes from 1 to length(text). Thank you for the link to the paper.