Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^8: Byte allign compression in Perl..

by BrowserUk (Patriarch)
on Apr 13, 2008 at 10:00 UTC ( [id://680065]=note: print w/replies, xml ) Need Help??


in reply to Re^7: Byte allign compression in Perl..
in thread Byte allign compression in Perl..

As far as I can see the intersection problem is On*Ologn which make scaling it problematic.

How does 312ms sound for a two word query in a DB containing 15,000 words in 5000 documents (ave:554 words/doc; 2.7million word-document pairs?

678848=# select 678848-# "wordFromId"( "word-id" ), 678848-# "docnameFromId"( "doc-id" ), 678848-# posns 678848-# from "word-doc" 678848-# where 678848-# "word-id" in ( 678848(# "idFromWord"( 'aachen' ), 678848(# "idFromWord"( 'zwitterions' ) 678848(# ) 678848-# and 678848-# "doc-id" in ( 678848(# select "docIdsContaining"( "idFromWord"( 'aach +en' ) ) 678848(# intersect 678848(# select "docIdsContaining"( "idFromWord"( 'zwit +terions' ) ) 678848(# ) 678848-# order by 678848-# "word-id", "doc-id" 678848-# ; wordFromId | docnameFromId | posns -------------+---------------+------- aachen | document_223 | 1723 aachen | document_940 | 1903 aachen | document_1778 | 273 aachen | document_1897 | 3163 aachen | document_3990 | 817 aachen | document_4407 | 4736 zwitterions | document_223 | 3072 zwitterions | document_940 | 3504 zwitterions | document_1778 | 664 zwitterions | document_1897 | 4186 zwitterions | document_3990 | 355 zwitterions | document_4407 | 1459 (12 rows) 678848=# select count( * ) from "word-doc";; count --------- 2719282 (1 row)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^9: Byte allign compression in Perl..
by tachyon-II (Chaplain) on Apr 13, 2008 at 10:25 UTC

    Sound good enough for government work (in fact you beat Google) and certainly a practical working solution to the problem as stated. It does sound as though the OP has a bigger data set in mind though.

    One point he does not seem to get is how the need to do this is reduced by caching as each time you had a search for "Some term" you would cache the first n results so you don't have to repeat the expensive task again. You can see Google do this if you search for 'aachen zwitterions'. The first search took 0.35 seconds to retrieve 514 results but the second search took only 0.07 seconds. I tried this on google.com.au and google.com so you will need to try another obscure pair or a different server farm to see it. What amazes me is the raw speed but then again there are probably several thousand machines dealing with the query.

    Intruiging dataset you have there! I won't ask.

      It does sound as though the OP has a bigger data set in mind though.

      I'm setting up a bigger DB (and one that doesn't require all the quoted field/table/function names), as we speak to test the scalability, but on my preliminary tests it should scala pretty nearly linearly. (Famous last words :)

      Intruiging dataset you have there!

      I selected 15,000 words at random from my dictionary. Then 554 words at random from that 5000 times to become the documents. They wouldn't make for very interesting reading :)

      The schema I'm using is the one I posted earlier. And some of the complexity of the query is hidden behind user defined functions in the DB.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://680065]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-18 23:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found