Here is what i 've done..i change the schema by adding the TF(term frequency) for each term in each document like below:
Term1->DocId1TFposDocId2TFposDocId2TFpos......
I add the tf so to distinguish the doc ids and the positions..i suppose there is other way to distinguish the docs and the pos without to add more info but until now i didnt find it...
Here i save in the bit string 500000 documents where for each doc i keep the Tf and 4 positions...
use strict;
use Devel::Size qw(size);
my $df=500000;my $tf=3;
my $wektor = '';
my $packed = '';
my $nr=0;
for(0 .. $df)
{
vec ($wektor, $nr++, 32) = $_; # DOC ID......
vec ($wektor, $nr++, 32) = $tf; # TF......
for(0 .. $tf)
{
vec ($wektor, $nr++, 32) = $_+10; # POSITIONS
}
}
print "Vector's size: " .
size( $wektor ) . " bytes\n";
#print $nr,"\n";
###################### UNPACK VECTOR2.....
my %vec;
my $docID=0;
my $tf=0;
my $index=0;
my $Aa=time();
for(0 .. $df)
{
$docID = vec ($wektor, $index++, 32);
$tf = vec ($wektor, $index++, 32);
$vec{$docID}=$tf;
# print "Doc id: $docID\ttf: $tf\n";
for(0 .. $tf)
{
# print "\t\tpositions: ",vec ($wektor, $index++, 32),"\n";
vec ($wektor, $index++, 32);
}
}
print "unpack vector in \t",time()-$Aa," secs...(oh Yeah!!!)\n";
Vector's size: 12000052 bytes
unpack vector in 4 secs...(oh Yeah!!!)
As you can see from the code i save only the docId and the Tf in a hash without saving the positions..i am trying to find the appropriate structure to keep all this info..
Thats all for now ...i hope you find something faster....
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.