OK, you may need to adjust the definition of $total as indicated somewhere; otherwise following program works as i understood your problem...

#! /usr/local/bin/perl -w use strict ; my $stopfile = 'stopwords'; my %stoplist; # fill stop word list assuming each word is on one line open STOP, "<$stopfile" or die "cannot open $stopfile: $!\n"; while ( defined (my $stop = <STOP>) ) { chomp $stop; $stoplist{$stop} = 1; } close STOP or die "cannot close $stopfile: $!\n"; # FIRST file contains the words to compare against, # get the target word list # my @target = @{ filter( \%stoplist , [ shift @ARGV ] ) }; # rest of the files contain words which we want # to compare against the target list # my @words = @{ filter( \%stoplist , \@ARGV ) }; # adjust as desired as i fail to see what is @D1 (in OP) and # why $total needs to be the twice the size of @D1 # # BELOW IS MY NOTION OF $total # my $total = scalar @target + scalar @words; my $similarity = 2 * ( scalar @{ intersect( \@target , \@words ) } / $total ); # display similarity upto 4 decimal places printf "\nsimilarity is: %0.4g\n\n", $similarity; # find intersection of two arrays: 1st contains all the interesting v +alues, # 2d both interesting & uninteresting sub intersect { my ($ref , $misc) = @_; my %intersection; foreach my $misc ( @{$misc} ) { foreach my $ref ( @{$ref} ) { next if $misc ne $ref; $intersection{$ref} = 1; } } return [ keys %intersection ]; } # given a stop word hash & file name array (consisting of input word +list), # return the word list that are not stop words sub filter { my ($stop , $files) = @_; my %filtered; foreach my $file ( @{$files} ) { open FH , "<$file" or die "cannot open $file to read: $!\n"; while ( defined (my $line = <FH>) ) { foreach my $word (@{ line2words( $line ) }) { next if $stop->{$word}; $filtered{$word} = 1; } } close FH or die "cannot close $file: $!\n"; } return [ keys %filtered ]; } # return words, lower cased, from a given line sub line2words { my $line = $_[0]; return [ map { lc $_ } grep { $_ ne '' } split /\W+/ , $line ]; }


Update: Add missing die if cannot close STOP.


In reply to Re: Re: Re: Calculating "similarity" by parv
in thread Calculating "similarity" by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.