Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Search for identical substrings

by BrowserUk (Patriarch)
on Aug 18, 2005 at 07:54 UTC ( [id://484697]=note: print w/replies, xml ) Need Help??


in reply to Search for identical substrings

... the length of my strings (3k characters), and the number of elements (300) leads to prohibitive times for my search. It took a week just to check one element of the array against every other element.

Can you confirm this please. Your current method took 1 week to do 299 LCSs. Which as you have (300 * 299)/2 = 44,850 to do, this would take 150 weeks to perform the processing?

If so, I think I can help you. I believe I can get that down to 67 hours. But, as this is so much quicker (than both your current method and a couple of others I have tried), I would very much like to verify my program against some known data.

So, if you could let us/me have say 5 of your 3k strings, and the LCS that your current method finds + the time taken, I could check what I have against your findings before exposing any stupidities to the world.

TIA.

Alternatively, I could provide my test data for you to try.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Replies are listed 'Best First'.
Re^2: Search for identical substrings
by bioMan (Beadle) on Aug 18, 2005 at 16:19 UTC

    Your time estimates agree with mine. I calculated a time of completion of 3 years.

    Thank you for your offer. I would like to look at all my options first, including, abandoning the project, optimizing the data be removing redundant sequences (no easy task given the lack of documentation for some of my data), or subclassing the data into smaller sets of sequences.

    I would also like to look at the other responses I've received, but I will not forget your offer.

      Can you generate a data set that is representative of the problem and put it in your scratchpad?


      Perl is Huffman encoded by design.

        I have placed six actual strings from my database into my public scratchpad. Each string is formated as follows:

        >string 1 ATGCTGTAGCATGCATG...CGATCATGTGACTACGT >string 2 . . .

        The first line starts with ">" followed by a string ID. The second line is the actual data string.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://484697]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2024-04-19 21:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found