http://qs1969.pair.com?node_id=1166649

bms9nmh has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have the following perl script which compares words in the 2nd field of two csv files, and if 5 or more words match then it prints both line together as a positive match.
#!/bin/perl + + my @csv2 = (); + open CSV2, "<csv2" or die; + @csv2=<CSV2>; + close CSV2; + + my %csv2hash = (); + for (@csv2) { + chomp; + my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title + $csv2hash{$_} = $title; + } + + open CSV1, "<csv1" or die; + while (<CSV1>) { + chomp; + my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title + my %words; + $words{$_}++ for split /\s+/, $title; #/ get words + ## Collect unique words + my @titlewords = keys(%words); + my @new; #add exception words which shouldn +'t be matched foreach my $t (@titlewords){ + push(@new, $t) if $t !~ /^(rare|vol|volume|issue|double|magazi +ne|mag)$/i; } + @titlewords = @new; + my $desired = 5; + my $matched = 0; + foreach my $csv2 (keys %csv2hash) { + my $count = 0; + my $value = $csv2hash{$csv2}; + foreach my $word (@titlewords) { + my @matches = ( $value=~/\b$word\b/ig ); + my $numIncsv2 = scalar(@matches); + @matches = ( $title=~/\b$word\b/ig ); + my $numIncsv1 = scalar(@matches); + ++$count if $value =~ /\b$word\b/i; + if ($count >= $desired || ($numIncsv1 >= $desired && $numI +ncsv2 >= $desired)) { $count = $desired+1; + last; + } + } + if ($count >= $desired) { + print "$csv2\n"; + ++$matched; + } + } + print "$_\n\n" if $matched; + } + close CSV1;
I would now like to add extra functionality so that- if both the lines contain a year, in the format 1989, then the years in each csv have to match for it to be considered a positive match. However, if only one of the lines contains a year, then I would like the usual 5 matching words rule to apply and the year becomes irrelevant. Examples just to clarify. In this example field 2 contains 5 matching words but the two years (1973 + 2013) are different so this would be discounted as a match:
2523021356, RARE TV RADIO TIMES MAGAZINE DOCTOR WHO 1973 THE THREE 3 +DOCTORS DR JON PERTWEE, http://www.example.co.uk, 12 12278788, TV & SATELLITE WEEK 11 MAY 2013 GILLIAN ANDERSON DOCTOR WHO +NOT RADIO TIMES , http://www.example.co.uk, 12
In this example the years are the same AND there are 5 or more matching words so this would be a positive match:
2523021356, RARE TV RADIO TIMES MAGAZINE DOCTOR WHO 1973 THE THREE 3 +DOCTORS DR JON PERTWEE, http://www.example.co.uk, 12 12278788, TV & SATELLITE WEEK 11 MAY 1973 GILLIAN ANDERSON DOCTOR WHO +NOT RADIO TIMES , http://www.example.co.uk, 12
In this example, only one of the titles contain a year (1973), but there are also 5 or more matching words, I would like this to be considered a positive match:
2523021356, RARE TV RADIO TIMES MAGAZINE DOCTOR WHO 1973 THE THREE 3 +DOCTORS DR JON PERTWEE, http://www.example.co.uk, 12 12278788, TV & SATELLITE WEEK 11 MAY GILLIAN ANDERSON DOCTOR WHO NOT R +ADIO TIMES , http://www.example.co.uk, 12
In this example, none of the titles contains a year, but 5 or more words match so this would be a positive match:
2523021356, RARE TV RADIO TIMES MAGAZINE DOCTOR WHO THE THREE 3 DOCTO +RS DR JON PERTWEE, http://www.example.co.uk, 12 12278788, TV & SATELLITE WEEK 11 MAY GILLIAN ANDERSON DOCTOR WHO NOT R +ADIO TIMES , http://www.example.co.uk, 12
How can I add this functionality to the script without making wholesale changes?

Replies are listed 'Best First'.
Re: Addional "year" matching functionality in word matching script
by Corion (Patriarch) on Jun 27, 2016 at 11:18 UTC

    Consider extracting all the things that look like a year number, and changing the matching logic to first check for the equivalence of the years and then falling back on the five words equivalence. Something like:

    my $year_left = ...; my $year_right = ...; my $have_years = ($year_left and $year_right); my $equal_years = ($year_left == $year_right); my $five_words_result = ...; # you already have this above my $final_result; if( ! $have_years ) { $final_result = $five_words_result; } elsif( $equal_years ) { $final_result = $five_words_result; } elsif( ! $equal_years ) { $final_result = undef; # without matching years, things are never +equal } else { die "This should never happen. left=$year_left, right=$year_right, + have_years=$have_years, equal_years=$equal_years"; };

    Maybe it would be good to put this logic, together with the logic for finding five or more matching words, into its own function. Consider passing to that function the title on the left side and the title on the right side, and possibly already the extracted year numbers.

      Thanks for the response, I should mention that I'm a noob to perl and my original code came from copying from other scripts and help from others. The bits in the code that say = ...; is this what I would actually write in the script or do I need to put something else in the space. Sorry, I'm a bit lost! I usually use bash and started using perl recently!

        perlsyn and perlfunc will get you the definition of stuff you don't understand. perldebtut will get you started with the Perl debugger, which can be a really useful tool to learn how stuff works; you can try out commands interactively. Anything else, ask! We were all noobs once.

        But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

        Sorry - yes, the ... parts are logic that I left out.

        For finding the year within a title, you already have similar logic and it's not too hard to write code that finds things looking like a year within another string.

        For determining the five-matching-words logic, you already have that logic in your script, you just need to assign the result of that logic to a variable before doing further checks like available or matching years.

      edit: sorry just realized you answered this in previous reply

        Yes, I chose $year_left as the name for the variable holding the year number on the left side of the comparison, and $year_right for the right side of the comparison.

        You can look at a string and guess if it contains a four digit year by using the following code for example:

        my $str = 'this is some text 1989 blah'; my $year; $year = $1 if( $str =~ /\b((?:19|20)\d\d)\b/ );

        The regular expression looks at whether the string contains at least one number with four digits that starts with 19 or 20, and sets $year to the first such number. You could put that code into a subroutine as follows to allow for easy reuse:

        sub find_year { my( $str ) = @_; my $year; $year = $1 if( $str =~ /\b((?:19|20)\d\d)\b/ ); return $year }
Re: Addional "year" matching functionality in word matching script
by Cow1337killr (Monk) on Jun 28, 2016 at 09:25 UTC

    I would put together a small test program to test just this new logic along with a small test file, maybe a dozen lines.

    If all goes well, then you just have to figure out how to drop the logic from your test program into the real program and test it with your small test file and then test it with your production test file.