HI, I have some problems in doing the a perl program about similarity. See if anyone helps. thanks

Similiarity contains a formula to calculate liks this:

Similiarity = 2 x ( intersection/ total)

I tried to solve the problem, however i'm stuck in the middle. Since when i write the program, i need to run a stoplist in the program and fliter some words out from the stoplist before calculating the rest of the words in the files. The main point is to use one files and compare with the rest of the files.

However, when i was doing it, i do not know how to convert some command from hash to array or vice versa, therefore, i am stuck.

here's my script, i hope if anyone can help me.:

#! /usr/local/bin/perl -w use strict ; my $stopfile = 'stopwords'; my $base= shift @ARGV; my @files = @ARGV; my %stopwords=(); my %basefilterwords=(); my %filterwords=(); my @basewords; my @words; open STOP, "<$stopfile"; while (my $stopword =<STOP>) { chomp $stopword; $stopwords {$stopword} =1; } close STOP; open BASETEXT, "<$base"; while (my $line =<BASETEXT> ) { my @basewords = split /\W/, $line ; foreach my $baseword (@basewords) { if ($baseword ne '') { $baseword = lc $baseword ; } if ($stopwords{$baseword}) { } else { $basefilterwords{$baseword}=1; } } close BASETEXT; foreach my $file ( @ARGV ) { open TEXT, "<$file"; while (my $line =<TEXT> ) { my @words = split /\W/, $line ; foreach my $word (@words) { if ($word ne '') { $word = lc $word ; } if ($stopwords{$word}) { } else { $filterwords{$word}=1; } } close TEXT; } }
I just did until here, starting to fliter the words, then i am stuck in here since i do not know how to change the cammand into array.. here it is:
@D1 = map lc $_, $D1 =~ /(\w+)/g ; my @D2 = map lc $_, $D2 =~ /(\w+)/g ; my %D2 = () ; @D2{@D2} = (1) x scalar @D2 ; my $total = scalar @D1 + scalar @D1 ; my $intersection = 0 ; # count the number of words in common foreach my $word ( @D1 ) { ++$intersection if $D2{$word} ; } my $similarity = 2 * ( $intersection/$total ) ; print "\n$similarity\n\n" ;
I am sure that this part needs to have some changes, however, I really do not understand. I hope there has people can help me to solve it thanks.

In reply to Calculating "similarity" by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.