comment on

Hi, i got problems in finding the term weight. It something about similiarty. I can solve the problem of similarity now. however, i do not know how to solve and find term weight. My question is i need to use 5 files and then find the term weight of each word in each document. However, my program must need to fliter words through the stoplist and calculate the term weight on stems (as determined by Lingua::Stem::En) instead on words.
However, i fail to find the most frequency word in the file. and then i am stuck.. can anyone help?? thanks

#! /usr/local/bin/perl -w
use strict;
use lib qw(.);
use Lingua::Stem::En;

my $stopfile = 'stopwords';
my $base= shift @ARGV;


open STOP, "<$stopfile"; 
chomp( my @stop= <STOP> );
close STOP;

my %stopwords=();
 add the empty string '' to the stopwords as well
@stopwords{@stop,''} = ();

 read in basefile
my %D1=();
my $result='';
my $top=0;
open BASETEXT, "<$base";
while ( <BASETEXT> ) {
   @D1{ map { my $l = lc; exists $stopwords{$l}?():$l } split /\W+/ } 
+= ();
}


my %D2=();

while ( <> ) {
   my %frequency=();
   my @D2 =  map { my $l = lc; exists $stopwords{$l}?():$l } split /\W
++/ ; #= ();
   %D2 = @D2;
 foreach my $word ( @D2 ) {
    $frequency{$word} = 0 ;
    }

   foreach my $word ( @D2 ) {
    $frequency{$word} = $frequency{$word} + 1 ;
    }

foreach my $word (keys %frequency) {
if ( $frequency {$word} > $top) {
         $result = $word;
         $top= $frequency {$word};
                                                                  }
                                   }
   
    print "file $ARGV testing: $result\n";
    print "number of times: $top\n";
    foreach my $word (@D2) {
    print "$word\n";
   }
}
continue {
   if (eof) {

      my $total = (scalar keys %D1) + (scalar keys %D1);
     my $total = scalar keys %D2;
      my $intersect = 0;

      foreach my $key (keys %D1) {
         $intersect++ if exists $D2{$key};
      }
     my $similarity = 2*$intersect/$total;
      print "Similarity between $base and $ARGV = $similarity\n";
      #print "\t@{[keys %D2]}\n";
      #print "\t@{[keys %D1]}\n";
      %D2 = ();
     $similarity = 0;
   }
}
[download]

Edited 2003-03-05 by mirod: added <code> tags

In reply to Term weight by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.