comment on

Hi, i finally try my best to do and find the term weight in terms of stem but not on words by using the Lingua::Stem::En. however, i stilll cannot solve it yet.. cuz i do not know how to calculate the term weight and the highest term weight.. can someone look at my script and give some opinion??? thanks... term weight formula is::
weight(i,j)= frequency of term i in document j times log of (no. of documents in the collection divide the number of documents term i occurs in)
here's my script:

#! /usr/local/bin/perl -w ;

use warnings ;
use strict ;
use lib qw(.);
use Lingua::Stem::En;


my($file1, $file2, $file3, $file4, $file5, $file6) = @ARGV ;

local $/ = undef ;

my @D1words = () ;
my @D2words = () ;
my @D3words = () ;
my @D4words = () ;
my @D5words = () ;
my @STOPWORDS = () ;

my $D1 = 0 ;
my $D2 = 0 ;
my $D3 = 0 ;
my $D4 = 0 ;
my $D5 = 0 ;
my $STOP = 0 ;

open FILE1, "<$file1";
while (<FILE1>) {
    $D1 = lc $_ ;
    @D1words = (@D1words, split /\W/, $D1) ;
}
close FILE1 ;

open FILE2, "<$file2";
while (<FILE2>) {
    $D2 = lc $_ ;
    @D2words = (@D2words, split /\W/, $D2) ;
}
close FILE2 ;

open FILE3, "<$file3";
while (<FILE3>) {
    $D3 = lc $_ ;
    @D3words = (@D3words, split /\W/, $D3) ;
}
close FILE3 ;

open FILE4, "<$file4";
while (<FILE4>) {
    $D4 = lc $_ ;
    @D4words = (@D4words, split /\W/, $D4) ;
}
close FILE4 ;

open FILE5, "<$file5";
while (<FILE5>) {
    $D5 = lc $_ ;
    @D5words = (@D5words, split /\W/, $D5) ;
}
close FILE5 ;

open FILE6, "<$file6";
while (<FILE6>) {
    $STOP = lc $_ ;
    @STOPWORDS = (@STOPWORDS, split /\W/, $STOP) ;
}
close FILE6 ;



my $STOPWORDS = () ;
my $D1words = () ;
my $D2words = () ; 
my $D3words = () ;
my $D4words = () ;
my $D5words = () ;

my %STOPWORDS = () ;
my %D1frequency = () ;
my %D2frequency = () ;
my %D3frequency = () ;
my %D4frequency = () ;
my %D5frequency = () ;

foreach $STOPWORDS (@STOPWORDS) {
    foreach $D1words ( @D1words ) {
    (s/$D1words/ /i) if $STOPWORDS{$D1words} ;
   
    }
    foreach $D2words ( @D2words ) {
    (s/$D2words/ /i) if $STOPWORDS{$D2words} ;

    }
    foreach $D3words ( @D3words ) {
    (s/$D3words/ /i) if $STOPWORDS{$D3words} ;
    }
    foreach $D4words ( @D4words ) {
    (s/$D4words/ /i) if $STOPWORDS{$D4words} ;

    }
    foreach $D5words ( @D5words ) {
    (s/$D5words/ /i) if $STOPWORDS{$D5words} ;

    }
}


my $stemmed_D1words = () ;
my $stemmed_D2words = () ; 
my $stemmed_D3words = () ;
my $stemmed_D4words = () ;
my $stemmed_D5words = () ;
my %exceptions = () ;

my @stemmed_D1words = () ;
my @stemmed_D2words = () ;
my @stemmed_D3words = () ;
my @stemmed_D4words = () ;
my @stemmed_D5words = () ;


$stemmed_D1words = Lingua::Stem::En::stem(
{ -words => \@stemmed_D1words,
  -locale => 'en',
  -exceptions => \%exceptions,
});
$stemmed_D2words = Lingua::Stem::En::stem(
{ -words => \@stemmed_D2words,
  -locale => 'en',
  -exceptions => \%exceptions,
});
$stemmed_D3words = Lingua::Stem::En::stem(
{ -words => \@stemmed_D3words,
  -locale => 'en',
  -exceptions => \%exceptions,
});
$stemmed_D4words = Lingua::Stem::En::stem(
{ -words => \@stemmed_D4words,
  -locale => 'en',
  -exceptions => \%exceptions,
});
$stemmed_D5words = Lingua::Stem::En::stem(
{ -words => \@stemmed_D5words,
  -locale => 'en',
  -exceptions => \%exceptions,
});


my $stemmed_D1count = 0 ;
my $stemmed_D2count = 0 ;
my $stemmed_D3count = 0 ;
my $stemmed_D4count = 0 ;
my $stemmed_D5count = 0 ;

my $stemmed_D1frequency = () ;
my $stemmed_D2frequency = () ;
my $stemmed_D3frequency = () ;
my $stemmed_D4frequency = () ;
my $stemmed_D5frequency = () ;

my %stemmed_D1frequency = () ;
my %stemmed_D2frequency = () ;
my %stemmed_D3frequency = () ;
my %stemmed_D4frequency = () ;
my %stemmed_D5frequency = () ;

foreach $stemmed_D1words ( @stemmed_D1words ) {
    $stemmed_D1count = $stemmed_D1count + 1 ;
    $stemmed_D1frequency{$stemmed_D1words} = $stemmed_D1frequency{$ste
+mmed_D1words} + 1 ;
    }
foreach $stemmed_D2words ( @stemmed_D2words ) {
    $stemmed_D2count = $stemmed_D2count + 1 ;
    $stemmed_D2frequency{$stemmed_D2words} = $stemmed_D2frequency{$ste
+mmed_D2words} + 1 ;
    }    
foreach $stemmed_D1words ( @stemmed_D1words ) {
    $stemmed_D3count = $stemmed_D3count + 1 ;
    $stemmed_D3frequency{$stemmed_D3words} = $stemmed_D3frequency{$ste
+mmed_D3words} + 1 ;
    }
foreach $stemmed_D4words ( @stemmed_D4words ) {
    $stemmed_D4count = $stemmed_D4count + 1 ;
    $stemmed_D4frequency{$stemmed_D4words} = $stemmed_D4frequency{$ste
+mmed_D4words} + 1 ;
    }
foreach $stemmed_D5words ( @stemmed_D5words ) {
    $stemmed_D5count = $stemmed_D5count + 1 ;
    $stemmed_D5frequency{$stemmed_D5words} = $stemmed_D5frequency{$ste
+mmed_D5words} + 1 ;
    }
[download]

what i did above, is i already load in the five files with a stoplist, then i fliter out the words and then i use the Lingua::Stem::En to change words to stem, then i find the frequency of the stems...
then i am stuck since i need to find the highest term weight and the term weight of each word... thanks

Edited 2003-03-06 by mirod: added code tags

In reply to term weight by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.