in reply to Exact string matching
well though the fellow monks have posted some interesting solutions, and that is why this forum is much better than any other, i think that the real answer, the one that meets your requirements, is : "You cannot!"
What do i mean by this, well i have been into string bizz for a while now, and dealing with the exact problem that you have stated and those bit more complicated. First, to meet the requirements in terms of ram you will not achieve this in perl. try c, and even then you will have some high constants associated with your data structure. Second, the best linear time, proportional to the size of the pattern, searches, can be achieved by using suffix trees (Ukkonnen).
There is a perl module (module) that is extremely good for this and i would suggest buying some extra ram and using it instead writing the tree by yourself in c. It has an O(P) search time capabilities and as far as the "additional constants" associated with perl goes,well you can in fact neglect those since the tree will use much more space(RAM).
This was a comment to your statement:"...this is not efficient as I have to deal with file sized over GB's..."
Now as GrandFather always keeps reminding me:
"Note that "efficient" probably isn't the same as "fast"..."
and almost always is right, maybe this isn't what you need. if i were you i would look into bit-vectors -> through them you can compress your DNA by 75%, meaning 75% less memory required and 75% shorter the search space, that on a 2GB data will outperform(in terms of effective efficiency) the suffix tree and O(P) time complexity, though it doesn't run in linear time. However if you are thinking of using it on larger data sets, well welcome to the club and good luck !
Cheers
Baxy
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Exact string matching
by saranrsm (Acolyte) on Oct 17, 2011 at 18:23 UTC |