Re: Do you really want to use an array there?

Replies are listed 'Best First'.
Re^2: Do you really want to use an array there? by BrowserUk (Patriarch) on Apr 14, 2008 at 07:37 UTC
MimisIVI, if you mean vec, Tachyon-II pointed you at vec 5 days ago in Re^6: Compress positive integers. It's right there in the last line. Maybe you missed it? Anyhow, here is Elias Gamma coded in pure Perl: sub log2{ int( ( log( $_[0] ) + 1e-15 ) / log( 2 ) ) } sub PP_EliasPack { my $packed = ''; my $out = 0; for my $num ( @_ ) { my $len = log2( $num ); $out += $len; vec( $packed, $out++, 1 ) = 1; vec( $packed, $out++, 1 ) = ( $num & ( 1 << $_ ) ) ? 1 : 0 for 0 .. $len - 1; } return $packed; } sub PP_EliasUnpack { my $packed = shift; my $bits = length( $packed ) * 8; my $in = 0; my @unpacked; while( $in < $bits ) { my( $len, $num ) = ( 0 ) x 2; $len++ while $in < $bits && vec( $packed, $in++, 1 ) == 0; last if $in == $bits; vec( $packed, $in++, 1 ) and $num \|= ( 1 << $_ ) for 0 .. $len + - 1; $num \|= ( 1 << $len ); push @unpacked, $num; } return @unpacked; } [download] And for reference, here is the results of my benchmark with that included: C:\test>678848 Run with 15000 unique words in 1000 documents (Ave: 554 words/doc) ASCII uses 4755336 bytes W-BER uses 3196819 bytes Binary uses 3279128 bytes Elias uses 3980063 bytes PP_Elias uses 3980063 bytes 1 trial of Packing ascii (10.203s total) 1 trial of Unpacking ascii (3.159s total) 1 trial of Packing W-BER (18.159s total) 1 trial of Unpacking W-BER (1.516s total) 1 trial of Packing binary (9.910s total) 1 trial of Unpacking binary (1.455s total) 1 trial of Packing Elias (13.613s total) 1 trial of Unpacking Elias (2.739s total) 1 trial of Packing PP_Elias (31.128s total) 1 trial of Unpacking PP_Elias (18.094s total) [download] Notice that PP_Elias achieves identical compression to Elias in C (as you'd expect :), but that it's twice as slow at packing, and nearly 10 times slower when unpacking. If you want to continue with using Elias, and can't build the C version yourself, I could let you have it pre-built (for win32), but I have to wonder why you would when W-BER is faster and achieves better compression? Also, I wonder if you saw Re^8: Byte allign compression in Perl.. where I demonstrated that you can have the DB do the selection for you using the schema I suggested way back when, in 0.312 of a second? For the record, I found a small optimisation in the schema that reduce that by a factor of 10, to 31 milliseconds. So the DB does the selection, sends you just the data you need to do your proximity calculations, and does it all faster than you could pack a single integer. Interested? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^3: Do you really want to use an array there? by MimisIVI (Acolyte) on Apr 14, 2008 at 14:13 UTC
No my friend , i didnt missed the Tachyon comment about the vec function..but the deprecated's code help me to understand what is going on with this function.. For the SQL command that you propose i want to ask you for which server is appropriate because on MySQL there is no command for the intersection( i tried some inner join but the perfomance was very very very slow for 1GB dataset (250000 pages,Average document length : 602 words Number of unique words: 907806)... About your code for the Elias technique i have to say that it is 3 times faster than mine (Thanks one more time!!!)...The only reason why i want to use compression in my index is for perfomance reasons..that was my thougths untill now but it seems that i wasnt right..:(..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!) as the below code shows.. `use strict; use Devel::Size qw(size); my $wektor = ''; for(0 .. 10000000) { vec ($wektor, $_, 32) = $_; } print "Vector's size: " . size( $wektor ) . " bytes\n"; my @vec; my $Aa=time(); for(0 .. 10000000) { push @vec,vec ($wektor, $_, 32); } print "unpack vector in \t",time()-$Aa," secs...(oh Yeah!!!)\n";` [download] `Size of vector is 40000032 bytes unpack vector in 6 secs...(oh Yeah!!!)` [download] In the above code i used 4 bytes for each doc id. I tried with a vector where i save the same number of doc ids but with only 1 byte for each doc id ( i saved only small numbers) and the time was completely the same..I cant understand why..DOes anyone??	[reply] [d/l] [select]
Re^4: Do you really want to use an array there? by BrowserUk (Patriarch) on Apr 14, 2008 at 15:40 UTC
..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!) as the below code shows.. First off, using vec to pack 32-bit (Ie. byte, word and dword aligned) numbers is giving you a false impression of it's performance. It's when you start crossing those boundaries that the performance falls off sharply. If you wanted to just pack 32-bits on dword boundaries, pack 'V' (or N if your on a bigendian machine) is faster: `C:\test>p1 cmpthese -3, { pack_32bit => q[ my $packed = pack 'V', 1 .. 1e6 ], vec_32bit => q[ my $packed = ''; vec( $packed, $_, 32 ) = $_ for +1 .. 1e6 ] };; s/iter vec_32bit pack_32bit vec_32bit 5.58 -- -12% pack_32bit 4.94 13% --` [download] But neither gives you the compression you seek. About your code for the Elias technique i have to say that it is 3 times faster than mine* But it is still slower than `my $packed = pack 'w*', @numbers;` and achieves far less compression. pack 'w', (BER) compression is built in, gives the best compression and speed. For the SQL command that you propose i want to ask you for which server is appropriate because on MySQL there is no command for the intersection( i tried some inner join but the perfomance was very very very slow for 1GB dataset (250000 pages,Average document length : 602 words Number of unique words: 907806)... I gave up on MySQL a long time ago because of it's limitations. I has improved markedly in recent versions with allowing subselects places where it never used to, and the addition of stored procedures and stuff but I still prefer Postgres. In particular, the pgAdmin III tool is excellent for tuning your queries. I'll try building a DB to match those numbers and let you know how the performance pans out, but even if it was 50 times slower than with (5000/554/15000), which it won't be, it will still be 100 times faster than having to decompress 25 times as much data as you need, then select the 4% you do in Perl. I'll let you know. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^5: Do you really want to use an array there? by MimisIVI (Acolyte) on Apr 14, 2008 at 16:57 UTC
Re^6: Do you really want to use an array there? by BrowserUk (Patriarch) on Apr 14, 2008 at 17:36 UTC
Some notes below your chosen depth have not been shown here