in reply to Re: Re: Quickest method for matching
in thread Quickest method for matching

Hey Guys,

To be more specific on the "very large file" containing the DNA sequences...there are 27000 sequences and it looks something like this:
>identifier 1
ATG...(50ish to 4000 base pairs long)
>identifier 2
ATG...etc.
>identifier 3
Etc....

Actually the input file is the larger of the two...it contains much smaller chunks of DNA in the same format. Of these files there are probably 200,000 sequences.
I have several of these files that I will search against the other "large file"
Sorry for the lack of detail. I never know how much to give without boring people with useless details.

Cheers,

Dr.J

  • Comment on Re: Re: Re: Quickest method for matching

Replies are listed 'Best First'.
Re: Re: Re: Re: Quickest method for matching
by sauoq (Abbot) on Aug 07, 2002 at 18:29 UTC
    Of course this makes a great deal of difference.

    If you have more data to search for than data to search through, don't use the method I suggested.

    Given the narrow scope of your problem, there are probably a lot of optimizations you could make.

    • How often will you need to search for the same 200,000 sequences in different input?
    • How often will you need to perform a search on the same 27,000 larger sequences?
    • Do you know the probabality characteristics of your search? Can you expect many hits or very few?
    • What about overlaps?
    • Are any of your shorter search sequences present in any of your larger ones?
    • Is character data the best representation? (As you only use four characters, you might want to look into using bit strings or something instead.)

    If you don't expect to be doing this kind of search often, you might be better off just brute forcing it than trying to optimize it too much.

    Good Luck!

    -sauoq
    "My two cents aren't worth a dime.";