Re^3: Byte allign compression in Perl..

Hi Guys,

It seems that the straight unpack is much faster as the below benchmarks show...

Decode and encode Posting List with 250000 doc ids and 3 words per doc..

Hardware: Pentium 4 CPU 2.8 GHz, 1.5 GB RAM

1.First Version with the substr:
pack time:8s
unpack time:A LOT..i couldnt wait..

2.Second Version with the I template:
pack time:8s
unpack time:2s

3.Third Version with the w template:
pack time:3s
unpack time:3s

It seems tha we have some improvement thanks to you guys...but still i am not sutisfied

I will import in the benchmarks the Elias Gamma and the bYte allign code that i implement but i have to find a way to distinguish the doc ids from the relative positions...

Now lets talk about the shema that i use and the "useless" information that i handle in case where we have an AND query..

Lets assume that the user gives the query "Perl pack unpack"

First i use the lexicon to translate the terms into the corespondings word ids..Furthermore in lexicon i keep the DF (document frequency) for each term, so can do an efficient intersection by starting from the word with the smaller DF (which currently i dont :( )..
For this example the word unpack has DF equal with 25000 docs and the pach 50000 DF. Now as the IR boook you can do efficient intersections via embeded skip pointers.This was the cause why i choose this shema,but th eproblem is that i dont use it right until now. I supposed to do it, i have to use the filesystem to save my indexes and very low level commands like read. I have to read a lot about the perl filesystem to see how can i point a part of the file where exist the info of the asked term and how can be fast as the DBMS indexes.

Another reason why i dont choose your shema as you proposed is that the application that i am building is an indexing and searching tool where the user can index what ever he wants by crawling the web or his machine..so i cant save every index that the user builds in RAM thats why i am trying to find a efficient index shema which will be fast with no system optimizations..

Its a very very big talk and pleasant for me if you will continue to comment this thread guys..

P.S>One change that i am thinking to do in my shema is to replace the positons with the TF of each doc and to save the positions in other table..i have to think it more before start to implement it....

Comment on Re^3: Byte allign compression in Perl..

Replies are listed 'Best First'.
Re^4: Byte allign compression in Perl.. by tachyon-II (Chaplain) on Apr 13, 2008 at 04:56 UTC
I really think you need to drop your attachment to Elias gamma encoding. Even coded in C it is slower than either binary v/a or BER AND uses more space. While the space issue is data dependent you do get byte aligned data with either binary v/a or BER. Elias is not byte aligned so more difficult to index as the index must be byte plus a 0-7 bit offset. Not only is it likely to be slower and take more space (as shown by our testing) it is also harder to use. Lose. Lose. Lose. Actually coding a flag with Elias is trivial. The number of prepended zeros is 1 less than the number of significant digits that follow so 8 (1000) is encoded as 0001000. All we need to do is encode it as 10001000. When we read the bitstring if the first digit is a 1 we know it is a doc_id, if it is a zero it is a pos. The number of prepended zeros is not affected so we read them in the usual way to know how many significant binary digits to read. We end up at the correct boundary bit and just rinse and repeat. trying to find a efficient index shema which will be fast with no system optimizations.. Dream on. It sounds as though you want to build a Google and a Google desktop search with a single mid-spec machine and without doing it in the fastest language possible (C/C++). It would be helpful if you could outline what it is you are trying to do (search of some sort) and what you hope to achieve. What you don't seem to understand about the whole compression/disk transfer thing is this: The process of indexing is incremental ie you can't do it all at once, in memory and then dump the results. For each word you have a series of doc_id and pos data but how do you add to that? Anyway you wish to cut it once it is on disk it exists as a linear stream of bytes in one or more files. To add data to any given file requires you rewrite the entire content of the file (at least that part past the insertion point). The only way to avoid endless file rewrites is to have a chunk of empty space effectively pre-alocated to expected new data. This efficiency strategy is used throughout IT and basically trades space for speed. It is done in memory by programming languages like perl (that for example automatically allocate more memory to a string than you need just in case you want to add data to it, thus avoiding an expensive call to malloc and data rewrite) or the OS which typically leaves a bit of spare space at the end of a file in case you append to it. Before getting carried away I think you need to develop a test application that can do everything you want indexing wise in memory on a small represenstative dataset. Then you can see if your ideas actually work. If it actually works as you desire then look at how to scale that so it can be done on disk with incremental data addition.	[reply]

Replies are listed 'Best First'.

Re^4: Byte allign compression in Perl..
by tachyon-II (Chaplain) on Apr 13, 2008 at 04:56 UTC

I really think you need to drop your attachment to Elias gamma encoding. Even coded in C it is slower than either binary v/a or BER AND uses more space. While the space issue is data dependent you do get byte aligned data with either binary v/a or BER. Elias is not byte aligned so more difficult to index as the index must be byte plus a 0-7 bit offset. Not only is it likely to be slower and take more space (as shown by our testing) it is also harder to use. Lose. Lose. Lose.

Actually coding a flag with Elias is trivial. The number of prepended zeros is 1 less than the number of significant digits that follow so 8 (1000) is encoded as 0001000. All we need to do is encode it as 10001000. When we read the bitstring if the first digit is a 1 we know it is a doc_id, if it is a zero it is a pos. The number of prepended zeros is not affected so we read them in the usual way to know how many significant binary digits to read. We end up at the correct boundary bit and just rinse and repeat.

trying to find a efficient index shema which will be fast with no system optimizations..

Dream on. It sounds as though you want to build a Google and a Google desktop search with a single mid-spec machine and without doing it in the fastest language possible (C/C++).

It would be helpful if you could outline what it is you are trying to do (search of some sort) and what you hope to achieve.

What you don't seem to understand about the whole compression/disk transfer thing is this: The process of indexing is incremental ie you can't do it all at once, in memory and then dump the results. For each word you have a series of doc_id and pos data but how do you add to that? Anyway you wish to cut it once it is on disk it exists as a linear stream of bytes in one or more files. To add data to any given file requires you rewrite the entire content of the file (at least that part past the insertion point). The only way to avoid endless file rewrites is to have a chunk of empty space effectively pre-alocated to expected new data. This efficiency strategy is used throughout IT and basically trades space for speed. It is done in memory by programming languages like perl (that for example automatically allocate more memory to a string than you need just in case you want to add data to it, thus avoiding an expensive call to malloc and data rewrite) or the OS which typically leaves a bit of spare space at the end of a file in case you append to it.

Before getting carried away I think you need to develop a test application that can do everything you want indexing wise in memory on a small represenstative dataset. Then you can see if your ideas actually work. If it actually works as you desire then look at how to scale that so it can be done on disk with incremental data addition.

[reply]