Hello,
I have the following perl script which compares words in the 2nd field of two csv files, and if 5 or more words match then it prints both line together as a positive match.
#!/bin/perl
+
+
my @csv2 = ();
+
open CSV2, "<csv2" or die;
+
@csv2=<CSV2>;
+
close CSV2;
+
+
my %csv2hash = ();
+
for (@csv2) {
+
chomp;
+
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
+
$csv2hash{$_} = $title;
+
}
+
+
open CSV1, "<csv1" or die;
+
while (<CSV1>) {
+
chomp;
+
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
+
my %words;
+
$words{$_}++ for split /\s+/, $title; #/ get words
+
## Collect unique words
+
my @titlewords = keys(%words);
+
my @new; #add exception words which shouldn
+'t be matched
foreach my $t (@titlewords){
+
push(@new, $t) if $t !~ /^(rare|vol|volume|issue|double|magazi
+ne|mag)$/i;
}
+
@titlewords = @new;
+
my $desired = 5;
+
my $matched = 0;
+
foreach my $csv2 (keys %csv2hash) {
+
my $count = 0;
+
my $value = $csv2hash{$csv2};
+
foreach my $word (@titlewords) {
+
my @matches = ( $value=~/\b$word\b/ig );
+
my $numIncsv2 = scalar(@matches);
+
@matches = ( $title=~/\b$word\b/ig );
+
my $numIncsv1 = scalar(@matches);
+
++$count if $value =~ /\b$word\b/i;
+
if ($count >= $desired || ($numIncsv1 >= $desired && $numI
+ncsv2 >= $desired)) {
$count = $desired+1;
+
last;
+
}
+
}
+
if ($count >= $desired) {
+
print "$csv2\n";
+
++$matched;
+
}
+
}
+
print "$_\n\n" if $matched;
+
}
+
close CSV1;
I would now like to add extra functionality so that- if both the lines contain a year, in the format 1989, then the years in each csv have to match for it to be considered a positive match.
However, if only one of the lines contains a year, then I would like the usual 5 matching words rule to apply and the year becomes irrelevant.
Examples just to clarify.
In this example field 2 contains 5 matching words but the two years (1973 + 2013) are different so this would be discounted as a match:
2523021356, RARE TV RADIO TIMES MAGAZINE DOCTOR WHO 1973 THE THREE 3
+DOCTORS DR JON PERTWEE, http://www.example.co.uk, 12
12278788, TV & SATELLITE WEEK 11 MAY 2013 GILLIAN ANDERSON DOCTOR WHO
+NOT RADIO TIMES , http://www.example.co.uk, 12
In this example the years are the same AND there are 5 or more matching words so this would be a positive match:
2523021356, RARE TV RADIO TIMES MAGAZINE DOCTOR WHO 1973 THE THREE 3
+DOCTORS DR JON PERTWEE, http://www.example.co.uk, 12
12278788, TV & SATELLITE WEEK 11 MAY 1973 GILLIAN ANDERSON DOCTOR WHO
+NOT RADIO TIMES , http://www.example.co.uk, 12
In this example, only one of the titles contain a year (1973), but there are also 5 or more matching words, I would like this to be considered a positive match:
2523021356, RARE TV RADIO TIMES MAGAZINE DOCTOR WHO 1973 THE THREE 3
+DOCTORS DR JON PERTWEE, http://www.example.co.uk, 12
12278788, TV & SATELLITE WEEK 11 MAY GILLIAN ANDERSON DOCTOR WHO NOT R
+ADIO TIMES , http://www.example.co.uk, 12
In this example, none of the titles contains a year, but 5 or more words match so this would be a positive match:
2523021356, RARE TV RADIO TIMES MAGAZINE DOCTOR WHO THE THREE 3 DOCTO
+RS DR JON PERTWEE, http://www.example.co.uk, 12
12278788, TV & SATELLITE WEEK 11 MAY GILLIAN ANDERSON DOCTOR WHO NOT R
+ADIO TIMES , http://www.example.co.uk, 12
How can I add this functionality to the script without making wholesale changes?
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.