Hello Monks,
I need to write my own implementation of vector space model for searching thru given documents. I know there are scripts ready available, but I need to write one. The code that I wrote is working but takes almost 40 minutes to run for total token count of 5600 in 1500 documents. 99% of the time is used in building the hash for document & term frequency for each token. I need help on how can I improve the performance. Here is what I have
1) List of documents with stopword removed, stemmed tokens
2) I am looping thru all the documents to create a master hash of all the tokens. And in the same loop creating a hashes of hashes for each document containing document tokens and its frequency in the document.
e.g.
%hash1 = {1=>{air->'5',water->'6'},
2=>{orange->'2',air->'4',soup->'10'}
};
%hash2 = {$filename} = \%hash1;
The above code runs for each file in directory creating hash2 with key as filename & reference to hash1 which has its tokens & its frequency.
3) For each token in the master token hash(which has about 5600) tokens, search in all the documents
for $a (sort keys %mstrToken)
{
$df = 1;
foreach $doc (@docNames)
{
%tempHash = %{$hash2 {$doc}};
if(exists $tempHash{$a})
{
$tkfreq = $tempHash{$a};
$mh{$a}->{'docf'}=$df++;
$mh{$a}->{$doc} =$tkfreq;
}
}
#end of file processing for loop
}
This piece of code is taking about 40 minutes to run. How can I improve on this.
Thanks,
Stan
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.