Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i finally try my best to do and find the term weight in terms of stem but not on words by using the Lingua::Stem::En. however, i stilll cannot solve it yet.. cuz i do not know how to calculate the term weight and the highest term weight.. can someone look at my script and give some opinion??? thanks... term weight formula is::
weight(i,j)= frequency of term i in document j times log of (no. of documents in the collection divide the number of documents term i occurs in)
here's my script:
#! /usr/local/bin/perl -w ; use warnings ; use strict ; use lib qw(.); use Lingua::Stem::En; my($file1, $file2, $file3, $file4, $file5, $file6) = @ARGV ; local $/ = undef ; my @D1words = () ; my @D2words = () ; my @D3words = () ; my @D4words = () ; my @D5words = () ; my @STOPWORDS = () ; my $D1 = 0 ; my $D2 = 0 ; my $D3 = 0 ; my $D4 = 0 ; my $D5 = 0 ; my $STOP = 0 ; open FILE1, "<$file1"; while (<FILE1>) { $D1 = lc $_ ; @D1words = (@D1words, split /\W/, $D1) ; } close FILE1 ; open FILE2, "<$file2"; while (<FILE2>) { $D2 = lc $_ ; @D2words = (@D2words, split /\W/, $D2) ; } close FILE2 ; open FILE3, "<$file3"; while (<FILE3>) { $D3 = lc $_ ; @D3words = (@D3words, split /\W/, $D3) ; } close FILE3 ; open FILE4, "<$file4"; while (<FILE4>) { $D4 = lc $_ ; @D4words = (@D4words, split /\W/, $D4) ; } close FILE4 ; open FILE5, "<$file5"; while (<FILE5>) { $D5 = lc $_ ; @D5words = (@D5words, split /\W/, $D5) ; } close FILE5 ; open FILE6, "<$file6"; while (<FILE6>) { $STOP = lc $_ ; @STOPWORDS = (@STOPWORDS, split /\W/, $STOP) ; } close FILE6 ; my $STOPWORDS = () ; my $D1words = () ; my $D2words = () ; my $D3words = () ; my $D4words = () ; my $D5words = () ; my %STOPWORDS = () ; my %D1frequency = () ; my %D2frequency = () ; my %D3frequency = () ; my %D4frequency = () ; my %D5frequency = () ; foreach $STOPWORDS (@STOPWORDS) { foreach $D1words ( @D1words ) { (s/$D1words/ /i) if $STOPWORDS{$D1words} ; } foreach $D2words ( @D2words ) { (s/$D2words/ /i) if $STOPWORDS{$D2words} ; } foreach $D3words ( @D3words ) { (s/$D3words/ /i) if $STOPWORDS{$D3words} ; } foreach $D4words ( @D4words ) { (s/$D4words/ /i) if $STOPWORDS{$D4words} ; } foreach $D5words ( @D5words ) { (s/$D5words/ /i) if $STOPWORDS{$D5words} ; } } my $stemmed_D1words = () ; my $stemmed_D2words = () ; my $stemmed_D3words = () ; my $stemmed_D4words = () ; my $stemmed_D5words = () ; my %exceptions = () ; my @stemmed_D1words = () ; my @stemmed_D2words = () ; my @stemmed_D3words = () ; my @stemmed_D4words = () ; my @stemmed_D5words = () ; $stemmed_D1words = Lingua::Stem::En::stem( { -words => \@stemmed_D1words, -locale => 'en', -exceptions => \%exceptions, }); $stemmed_D2words = Lingua::Stem::En::stem( { -words => \@stemmed_D2words, -locale => 'en', -exceptions => \%exceptions, }); $stemmed_D3words = Lingua::Stem::En::stem( { -words => \@stemmed_D3words, -locale => 'en', -exceptions => \%exceptions, }); $stemmed_D4words = Lingua::Stem::En::stem( { -words => \@stemmed_D4words, -locale => 'en', -exceptions => \%exceptions, }); $stemmed_D5words = Lingua::Stem::En::stem( { -words => \@stemmed_D5words, -locale => 'en', -exceptions => \%exceptions, }); my $stemmed_D1count = 0 ; my $stemmed_D2count = 0 ; my $stemmed_D3count = 0 ; my $stemmed_D4count = 0 ; my $stemmed_D5count = 0 ; my $stemmed_D1frequency = () ; my $stemmed_D2frequency = () ; my $stemmed_D3frequency = () ; my $stemmed_D4frequency = () ; my $stemmed_D5frequency = () ; my %stemmed_D1frequency = () ; my %stemmed_D2frequency = () ; my %stemmed_D3frequency = () ; my %stemmed_D4frequency = () ; my %stemmed_D5frequency = () ; foreach $stemmed_D1words ( @stemmed_D1words ) { $stemmed_D1count = $stemmed_D1count + 1 ; $stemmed_D1frequency{$stemmed_D1words} = $stemmed_D1frequency{$ste +mmed_D1words} + 1 ; } foreach $stemmed_D2words ( @stemmed_D2words ) { $stemmed_D2count = $stemmed_D2count + 1 ; $stemmed_D2frequency{$stemmed_D2words} = $stemmed_D2frequency{$ste +mmed_D2words} + 1 ; } foreach $stemmed_D1words ( @stemmed_D1words ) { $stemmed_D3count = $stemmed_D3count + 1 ; $stemmed_D3frequency{$stemmed_D3words} = $stemmed_D3frequency{$ste +mmed_D3words} + 1 ; } foreach $stemmed_D4words ( @stemmed_D4words ) { $stemmed_D4count = $stemmed_D4count + 1 ; $stemmed_D4frequency{$stemmed_D4words} = $stemmed_D4frequency{$ste +mmed_D4words} + 1 ; } foreach $stemmed_D5words ( @stemmed_D5words ) { $stemmed_D5count = $stemmed_D5count + 1 ; $stemmed_D5frequency{$stemmed_D5words} = $stemmed_D5frequency{$ste +mmed_D5words} + 1 ; }
what i did above, is i already load in the five files with a stoplist, then i fliter out the words and then i use the Lingua::Stem::En to change words to stem, then i find the frequency of the stems...
then i am stuck since i need to find the highest term weight and the term weight of each word... thanks

Edited 2003-03-06 by mirod: added code tags

Replies are listed 'Best First'.
Re: term weight
by rob_au (Abbot) on Mar 06, 2003 at 11:15 UTC
    The following is a section of code from an indexing and search engine which I have written which should help you get up and running.

    # Step through each term found in the parsed content and proceed wit +h # term indexing foreach my $term ( split /\s+/, $self->{'_content'} ) { # Normalise the search term, allowing only characters in the ran +ge of a-z, # A-Z, digits and the underscore character. All terms are then +dropped to # lowercase to improve the likelihood of matching search results +. $term = $self->_normalise( $term ); next unless length $term; my ( $stem ) = @{ Lingua::Stem::stem( $term ) }; # Increment the frequency counters for the stemmed term - The _i +ndex_count is # the count of the number of documents which the stemmed term ap +pears in (not # the total count of all appearances of the stemmed term in all +documents) # while _index_frequency is the number of occurences of the stem +med term in # the current document indexed by $url # # The hash _index_stem is important to prevent duplicate documen +t counting # for documents which may have a stemmed term appear more than o +nce. ++${$self->{'_index_count'}}{$stem} unless ${$self->{'_index_stem' +}}{$stem}++; ++${${$self->{'_index_frequency'}}{$stem}}{$url}; }

    The term index weights are subsequently calculated based upon these stem and per-page counts. The only additional variable in this subroutine which needs explanation is the _crawl_visited hash reference - This hash, indexed by content source URI, stores meta information about the content.

    sub weights { my ( $self ) = @_; # Step through each stemmed term indexed foreach my $stem ( keys %{$self->{'_index_count'}} ) { # Step through each document in which the stemmed term $stem + appears, # calculate its weight and store this ranking in the %weight +s hash. my %weights; foreach my $url ( keys %{${$self->{'_index_frequency'}}{$stem} +} ) { $weights{$url} = sprintf "%.2f", ${${$self->{'_index_frequ +ency'}}{$stem}}{$url} * log( ( scalar keys %{$self->{'_crawl_visited' +}} ) / $self->{'_index_count'}->{$stem} ); } # Store ranking score in tied hash - Note the fashion by whi +ch the hash # reference is built first and then assigned to the MLDBM-ti +ed hash. This is # required due to the limitations of the Perl TIEHASH interf +ace which has no # support for multi-dimensional ties. ${$self->{'_tied_weight'}}{$stem} = \%weights; } }

    As outlined in this node, the Perlfect search engine which is written in Perl may also prove to be a useful resource for reference.

     

    perl -le 'print+unpack("N",pack("B32","00000000000000000000001000111010"))'

Re: term weight
by Hofmator (Curate) on Mar 06, 2003 at 10:30 UTC
    without even waiting till the code tags are added, I see a red flag waving at me ;-)

    It's never right to use a set of variables named stemmed_D1words, stemmed_D2words, ... you should go and try to read up about arrays of arrays starting e.g. in perlreftut or perldsc.

    Update Fixed link to perlreftut to point directly to www.perldoc.com. Why don't we have a node with that name here on pm like we have for perldsc?

    -- Hofmator