in reply to Exact string matching

Hi,

well though the fellow monks have posted some interesting solutions, and that is why this forum is much better than any other, i think that the real answer, the one that meets your requirements, is : "You cannot!"
What do i mean by this, well i have been into string bizz for a while now, and dealing with the exact problem that you have stated and those bit more complicated. First, to meet the requirements in terms of ram you will not achieve this in perl. try c, and even then you will have some high constants associated with your data structure. Second, the best linear time, proportional to the size of the pattern, searches, can be achieved by using suffix trees (Ukkonnen).
There is a perl module (module) that is extremely good for this and i would suggest buying some extra ram and using it instead writing the tree by yourself in c. It has an O(P) search time capabilities and as far as the "additional constants" associated with perl goes,well you can in fact neglect those since the tree will use much more space(RAM).

This was a comment to your statement:"...this is not efficient as I have to deal with file sized over GB's..."

Now as GrandFather always keeps reminding me:

"Note that "efficient" probably isn't the same as "fast"..."

and almost always is right, maybe this isn't what you need. if i were you i would look into bit-vectors -> through them you can compress your DNA by 75%, meaning 75% less memory required and 75% shorter the search space, that on a 2GB data will outperform(in terms of effective efficiency) the suffix tree and O(P) time complexity, though it doesn't run in linear time. However if you are thinking of using it on larger data sets, well welcome to the club and good luck !

Cheers

Baxy

Replies are listed 'Best First'.
Re^2: Exact string matching
by saranrsm (Acolyte) on Oct 17, 2011 at 18:23 UTC
    Dear Monk,
    

    Thank you very much for your wonderful and meaningful post. ya i bet, this is a wonderful platform where you get such caring guidance, there's no other place for perl other than this.

    I have already tried the recommended module that you have mentioned, but the problem with that is, it makes use of an external library written in C and more over that library is near an abandonware (stated by the author, i.e. he will accept the patches but has no idea to do by himself) and I cant modify it as to my needs because i am not good at C (my background is life science). I even tried to implement Suffix array and to my surprise there are no known perl module(if i am not mistaken) but havn't gave up i am trying it still.

    and ya what was that "bit vectors" all about, it seems interesting (Sry I havn't heard of it) could you shed some light over it or direct me to some digestible article so that i could grasp it (as i am not from computer science background)