Hi Guys,
It seems that the straight unpack is much faster as the below benchmarks show...
Decode and encode Posting List with 250000 doc ids and 3 words per doc..
Hardware: Pentium 4 CPU 2.8 GHz, 1.5 GB RAM
1.First Version with the substr:
pack time:8s
unpack time:A LOT..i couldnt wait..
2.Second Version with the I template:
pack time:8s
unpack time:2s
3.Third Version with the w template:
pack time:3s
unpack time:3s
It seems tha we have some improvement thanks to you guys...but still i am not sutisfied
I will import in the benchmarks the Elias Gamma and the bYte allign code that i implement but i have to find a way to distinguish the doc ids from the relative positions...
Now lets talk about the shema that i use and the "useless" information that i handle in case where we have an
AND query..
Lets assume that the user gives the query "Perl pack unpack"
First i use the lexicon to translate the terms into the corespondings word ids..Furthermore in lexicon i keep the DF (document frequency) for each term, so can do an efficient intersection by starting from the word with the smaller DF (which currently i dont :( )..
For this example the word unpack has DF equal with 25000 docs and the pach 50000 DF. Now as the
IR boook you can do efficient intersections via embeded skip pointers.This was the cause why i choose this shema,but th eproblem is that i dont use it right until now. I supposed to do it, i have to use the filesystem to save my indexes and very low level commands like read. I have to read a lot about the perl filesystem to see how can i point a part of the file where exist the info of the asked term and how can be fast as the DBMS indexes.
Another reason why i dont choose your shema as you proposed is that the application that i am building is an indexing and searching tool where the user can index what ever he wants by crawling the web or his machine..so i cant
save every index that the user builds in RAM thats why i am trying to find a efficient index shema which will be fast with no system optimizations..
Its a very very big talk and pleasant for me if you will continue to comment this thread guys..
P.S>One change that i am thinking to do in my shema is to replace the positons with the TF of each doc and to save the positions in other table..i have to think it more before start to implement it....
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.