stan131 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I need to write my own implementation of vector space model for searching thru given documents. I know there are scripts ready available, but I need to write one. The code that I wrote is working but takes almost 40 minutes to run for total token count of 5600 in 1500 documents. 99% of the time is used in building the hash for document & term frequency for each token. I need help on how can I improve the performance. Here is what I have
1) List of documents with stopword removed, stemmed tokens
2) I am looping thru all the documents to create a master hash of all the tokens. And in the same loop creating a hashes of hashes for each document containing document tokens and its frequency in the document.
e.g.
%hash1 = {1=>{air->'5',water->'6'}, 2=>{orange->'2',air->'4',soup->'10'} }; %hash2 = {$filename} = \%hash1;

The above code runs for each file in directory creating hash2 with key as filename & reference to hash1 which has its tokens & its frequency.
3) For each token in the master token hash(which has about 5600) tokens, search in all the documents
for $a (sort keys %mstrToken) { $df = 1; foreach $doc (@docNames) { %tempHash = %{$hash2 {$doc}}; if(exists $tempHash{$a}) { $tkfreq = $tempHash{$a}; $mh{$a}->{'docf'}=$df++; $mh{$a}->{$doc} =$tkfreq; } } #end of file processing for loop }
This piece of code is taking about 40 minutes to run. How can I improve on this.
Thanks, Stan

Replies are listed 'Best First'.
Re: hashes performance issue
by BrowserUk (Patriarch) on Mar 29, 2009 at 01:42 UTC

    It's not surprising that your loop is taking a long time. You're duplicating every hash in your HoHs, for every token in your token table. And it is completely unnecessary.

    This will probably run an order (or two) of magnitude faster:

    for $a (sort keys %mstrToken) { $df = 1; foreach $doc ( @docNames ) { if( exists $hash2{ $doc }{ $a } ) { $mh{$a}->{ docf } = $df++; $mh{$a}->{ $doc } = $hash2{ $doc }{ $a }; } } }

    That said, you really ought to consider using more descriptive names for your variables. If $a was $token, things would be much clearer. And %hash2? Is that the second hash in this program? Or the second one this week; decade; century?

    While your at it, move to use strict;. It's not obligatory, but you'll be glad you did in the long term.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks BrowserUk. This reduced the performance by a huge magnitude. Now its taking only about 22 seconds. This was the first time I used array references, but yes I did realize the mistake.
      I agree with your naming convention advise and will start using strict;
      I am still newbie with perl and learning all the best practices.
      Thanks,
      Stan
Re: hashes performance issue
by Your Mother (Archbishop) on Mar 29, 2009 at 00:57 UTC
Re: hashes performance issue
by Bloodnok (Vicar) on Mar 29, 2009 at 00:36 UTC
    You will get better performance if the perl installation (incl. the libraries) and files are all on the same host - that way you avoid network bottlenecks ... but you will still, in all probability, run into IO bottlenecks. You don't identify the OS, but in my experience, networking involving Windoze boxes is invariably slower.

    A user level that continues to overstate my experience :-))
      Bloodnok,
      I am using Putty on my Windows PC & using SSH to connect to Linux server and running the code out there. All perl installation (incl libraries) are on the Linux server.
      Thanks, Stan