Re^12: Addional "year" matching functionality in word matching script

Replies are listed 'Best First'.
Re^13: Addional "year" matching functionality in word matching script by Cow1337killr (Monk) on Jun 28, 2016 at 20:36 UTC
See http://perlmaven.com/the-default-variable-of-perl, for example. (There are many other webpages that one can visit to get similar tutorials. This one was at the top of the Google search results.) Also, you can Google "perl $_". The short answer is it is a special variable in Perl. Let us say, it is a special special variable in Perl. For example, it can come into play when reading files. Some beginners jump through hoops just to avoid using it. In the case of your program, the record that is read from the file ends up in `$_`. I had to mimic that behavior in my test program. I put the data (i.e., the one record) into a variable called `$record` just because I like the descriptive name `$record`. I could have named it `$milkshake` but I didn't. Then I said Oh The program expects this data to be in the special variable `$_`, so I put `$record` into `$_`. Otherwise, the rest of the code is from your program. I just took a section of your code out and made another program and tested it to make sure that it does what I think it does.	[reply] [d/l] [select]
Re^14: Addional "year" matching functionality in word matching script by bms9nmh (Novice) on Jun 30, 2016 at 15:15 UTC
Ok, I've deciphered what the initial part of the script does, and I've added some stuff to it which I will post separately once I understand this last bit of the script. I just need some help with the last bit before I try and put everything I've learned together. I've put some comments in the code below about bits I'm confused about. This is the last bit of the script which does the match. @titlewords = @new; #switch the @new array back to the name @titlewo +rds now that the exceptions are in place my $desired = 5; # Desired matching number of words my $matched = 0; # Why is this set to 0? How does it change dur +ing the comparison foreach my $csv2 (keys %csv2hash) { my $count = 0; #Again why is this set to 0 at this point? I can + see that it's used later and compared to $desired, but how does it i +ncrease in size past 0 during the operation? my $value = $csv2hash{$csv2}; # How does this represent the value +? There doesn't seem to be any code which counts the words here? foreach my $word (@titlewords) { my @matches = ( $value=~/\b$word\b/ig ); my $numIncsv2 = scalar(@matches); @matches = ( $title=~/\b$word\b/ig ); my $numIncsv1 = scalar(@matches); ++$count if $value =~ /\b$word\b/i; if ($count >= $desired \|\| ($numIncsv1 >= $desired && $numI +ncsv2 >= $desired)) { $count = $desired+1; last; } } if ($count >= $desired) { print "$csv2\n"; ++$matched; } } print "$_\n\n" if $matched; } + close CSV1; [download]	[reply] [d/l]
Re^15: Addional "year" matching functionality in word matching script by Cow1337killr (Monk) on Jul 01, 2016 at 02:30 UTC
Here is the program from your original post, unchanged except for numerous print statements. #!/usr/bin/perl # match5.pl perl match5.pl Test the entire program. # From http://www.perlmonks.org/?node_id=1166649 use strict; use warnings; print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print 'The program has started.', "\n"; + # This code is for testing. my @csv2 = (); + open CSV2, "<csv2" or die; + @csv2=<CSV2>; + close CSV2; + + my %csv2hash = (); + for (@csv2) { + chomp; + my ($title) = $_ =~ /^.+?,\s([^,]+?),/; #/ match the title + $csv2hash{$_} = $title; + } + + open CSV1, "<csv1" or die; + while (<CSV1>) { + chomp; + my ($title) = $_ =~ /^.+?,\s([^,]+?),/; #/ match the title + my %words; + $words{$_}++ for split /\s+/, $title; #/ get words + ## Collect unique words + my @titlewords = keys(%words); + my @new; #add exception words which shouldn +'t be matched foreach my $t (@titlewords){ + push(@new, $t) if $t !~ /^(rare\|vol\|volume\|issue\|double\|magazi +ne\|mag)$/i; } print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print '@new: ', join(", ", @new), "\n"; + # This code is for testing. @titlewords = @new; my $desired = 5; + my $matched = 0; + foreach my $csv2 (keys %csv2hash) { print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print 'xxxxxxxxxxxxxxxxxxxxxxxx At the top of the foreach my $csv2 + (keys %csv2hash) { outer loop xxxxxxxxxxxxxxxxxxxxxxxx', "\n"; # Th +is code is for testing. my $count = 0; + my $value = $csv2hash{$csv2}; print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print '$value: ', $value, "\n"; + # This code is for testing. foreach my $word (@titlewords) { print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print 'xxxxxxxxxxxxxxxxxxxxxxxx At the top of the foreach +my $word (@titlewords) { inner loop xxxxxxxxxxxxxxxxxxxxxxxx', "\n"; + # This code is for testing. my @matches = ( $value=~/\b$word\b/ig ); print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print '@matches: ', join(", ", @matches), "\n"; + # This code is for testing. my $numIncsv2 = scalar(@matches); print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print '$numIncsv2: ', $numIncsv2, "\n"; + # This code is for testing. @matches = ( $title=~/\b$word\b/ig ); print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print '@matches: ', join(", ", @matches), "\n"; + # This code is for testing. my $numIncsv1 = scalar(@matches); print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print '$numIncsv1: ', $numIncsv1, "\n"; + # This code is for testing. ++$count if $value =~ /\b$word\b/i; print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print '$count: ', $count, "\n"; + # This code is for testing. if ($count >= $desired \|\| ($numIncsv1 >= $desired && $numI +ncsv2 >= $desired)) { $count = $desired+1; print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print '$count: ', $count, "\n"; + # This code is for testing. last; + } + } + if ($count >= $desired) { print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print "$csv2\n"; + ++$matched; + } + } print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print "$_\n\n" if $matched; + } + close CSV1; print "File: ", __FILE__, " Line: ", __LINE__, "\n"; + # This code is for testing. print 'The program has ended.', "\n"; + # This code is for testing. __END__ [download] Here is the input file named csv2. `12278788, TV & SATELLITE WEEK 11 MAY GILLIAN ANDERSON DOCTOR WHO NOT R +ADIO TIMES , http://www.example.co.uk, 12` [download] Here is the input file named csv1. `2523021356, RARE TV RADIO TIMES MAGAZINE DOCTOR WHO THE THREE 3 DOCTOR +S DR JON PERTWEE, http://www.example.co.uk, 12` [download] Here is the output. Read more... (6 kB) Feel free to ask further questions.	[reply] [d/l] [select]
Re^16: Addional "year" matching functionality in word matching script by bms9nmh (Novice) on Jul 01, 2016 at 14:20 UTC
Re^17: Addional "year" matching functionality in word matching script by bms9nmh (Novice) on Jul 01, 2016 at 15:02 UTC
Some notes below your chosen depth have not been shown here

See http://perlmaven.com/the-default-variable-of-perl, for example. (There are many other webpages that one can visit to get similar tutorials. This one was at the top of the Google search results.)

Also, you can Google "perl $_".

The short answer is it is a special variable in Perl.

Let us say, it is a special special variable in Perl.

For example, it can come into play when reading files.

Some beginners jump through hoops just to avoid using it.

In the case of your program, the record that is read from the file ends up in $_. I had to mimic that behavior in my test program. I put the data (i.e., the one record) into a variable called $record just because I like the descriptive name $record. I could have named it $milkshake but I didn't. Then I said Oh The program expects this data to be in the special variable $_, so I put $record into $_. Otherwise, the rest of the code is from your program. I just took a section of your code out and made another program and tested it to make sure that it does what I think it does.

[reply]
[d/l]
[select]

 @titlewords = @new;  #switch the @new array back to the name @titlewo
+rds now that the exceptions are in place

 my $desired = 5;      # Desired matching number of words

 my $matched = 0;       # Why is this set to 0? How does it change dur
+ing the comparison

 foreach my $csv2 (keys %csv2hash) {

    my $count = 0;    #Again why is this set to 0 at this point? I can
+ see that it's used later and compared to $desired, but how does it i
+ncrease in size past 0 during the operation?

    my $value = $csv2hash{$csv2};  # How does this represent the value
+? There doesn't seem to be any code which counts the words here?

    foreach my $word (@titlewords) {

            my @matches   = ( $value=~/\b$word\b/ig );

            my $numIncsv2 = scalar(@matches);

            @matches      = ( $title=~/\b$word\b/ig );

            my $numIncsv1 = scalar(@matches);

            ++$count if $value =~ /\b$word\b/i;

            if ($count >= $desired || ($numIncsv1 >= $desired && $numI
+ncsv2 >= $desired)) {
                $count = $desired+1;

                last;

            }

    }

    if ($count >= $desired) {

      print "$csv2\n";

      ++$matched;

    }

  }
print "$_\n\n" if $matched;
}                                                                     
+                                           close CSV1;
[download]

[reply]
[d/l]

Here is the program from your original post, unchanged except for numerous print statements.

#!/usr/bin/perl

#  match5.pl      perl match5.pl  Test the entire program.
# From http://www.perlmonks.org/?node_id=1166649

use strict;
use warnings;

    print "File: ", __FILE__, " Line: ", __LINE__, "\n";              
+                   # This code is for testing.
    print 'The program has started.', "\n";                           
+                   # This code is for testing.

my @csv2 = ();                                                        
+                                             
open CSV2, "<csv2" or die;                                            
+                                             
@csv2=<CSV2>;                                                         
+                                             
close CSV2;                                                           
+                                             
                                                                      
+                                             
my %csv2hash = ();                                                    
+                                             
for (@csv2) {                                                         
+                                             
  chomp;                                                              
+                                             
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title         
+                                             
  $csv2hash{$_} = $title;                                             
+                                             
}                                                                     
+                                             
                                                                      
+                                             
open CSV1, "<csv1" or die;                                            
+                                             
while (<CSV1>) {                                                      
+                                             
  chomp;                                                              
+                                             
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title         
+                                             
    my %words;                                                        
+                                             
    $words{$_}++ for split /\s+/, $title;    #/ get words             
+                                             
    ## Collect unique words                                           
+                                             
    my @titlewords = keys(%words);                                    
+                                             
  my @new;                          #add exception words which shouldn
+'t be matched                                
  foreach my $t (@titlewords){                                        
+                                             
        push(@new, $t) if $t !~ /^(rare|vol|volume|issue|double|magazi
+ne|mag)$/i;                                  
  }
  print "File: ", __FILE__, " Line: ", __LINE__, "\n";                
+                   # This code is for testing.
  print '@new: ', join(", ", @new), "\n";                             
+                   # This code is for testing.
  @titlewords = @new;
  my $desired = 5;                                                    
+                                             
  my $matched = 0;                                                    
+                                             
  foreach my $csv2 (keys %csv2hash) {
    print "File: ", __FILE__, " Line: ", __LINE__, "\n";              
+                   # This code is for testing.
    print 'xxxxxxxxxxxxxxxxxxxxxxxx At the top of the foreach my $csv2
+ (keys %csv2hash) { outer loop xxxxxxxxxxxxxxxxxxxxxxxx', "\n";  # Th
+is code is for testing.
    my $count = 0;                                                    
+                                             
    my $value = $csv2hash{$csv2};
    print "File: ", __FILE__, " Line: ", __LINE__, "\n";              
+                   # This code is for testing.
    print '$value: ', $value, "\n";                                   
+                   # This code is for testing.
    foreach my $word (@titlewords) {
            print "File: ", __FILE__, " Line: ", __LINE__, "\n";      
+                   # This code is for testing.
            print 'xxxxxxxxxxxxxxxxxxxxxxxx At the top of the foreach 
+my $word (@titlewords) { inner loop xxxxxxxxxxxxxxxxxxxxxxxx', "\n"; 
+ # This code is for testing.
            my @matches   = ( $value=~/\b$word\b/ig );
            print "File: ", __FILE__, " Line: ", __LINE__, "\n";      
+                   # This code is for testing.
            print '@matches: ', join(", ", @matches), "\n";           
+                   # This code is for testing.
            my $numIncsv2 = scalar(@matches);
            print "File: ", __FILE__, " Line: ", __LINE__, "\n";      
+                   # This code is for testing.
            print '$numIncsv2: ', $numIncsv2, "\n";                   
+                   # This code is for testing.
            @matches      = ( $title=~/\b$word\b/ig );
            print "File: ", __FILE__, " Line: ", __LINE__, "\n";      
+                   # This code is for testing.
            print '@matches: ', join(", ", @matches), "\n";           
+                   # This code is for testing.
            my $numIncsv1 = scalar(@matches);
            print "File: ", __FILE__, " Line: ", __LINE__, "\n";      
+                   # This code is for testing.
            print '$numIncsv1: ', $numIncsv1, "\n";                   
+                   # This code is for testing.
            ++$count if $value =~ /\b$word\b/i;
            print "File: ", __FILE__, " Line: ", __LINE__, "\n";      
+                   # This code is for testing.
            print '$count: ', $count, "\n";                           
+                   # This code is for testing.
            if ($count >= $desired || ($numIncsv1 >= $desired && $numI
+ncsv2 >= $desired)) {
                $count = $desired+1;
                print "File: ", __FILE__, " Line: ", __LINE__, "\n";  
+                   # This code is for testing.
                print '$count: ', $count, "\n";                       
+                   # This code is for testing.
                last;                                                 
+                                             
            }                                                         
+                                             
    }                                                                 
+                                             
    if ($count >= $desired) {
      print "File: ", __FILE__, " Line: ", __LINE__, "\n";            
+                   # This code is for testing.
      print "$csv2\n";                                                
+                                             
      ++$matched;                                                     
+                                             
    }                                                                 
+                                             
  }
  print "File: ", __FILE__, " Line: ", __LINE__, "\n";                
+                   # This code is for testing.
  print "$_\n\n" if $matched;                                         
+                                             
}                                                                     
+                                             
close CSV1;

    print "File: ", __FILE__, " Line: ", __LINE__, "\n";              
+                   # This code is for testing.
    print 'The program has ended.', "\n";                             
+                   # This code is for testing.

__END__
[download]

Here is the input file named csv2.

12278788, TV & SATELLITE WEEK 11 MAY GILLIAN ANDERSON DOCTOR WHO NOT R
+ADIO TIMES , http://www.example.co.uk, 12
[download]

Here is the input file named csv1.

2523021356, RARE TV RADIO TIMES MAGAZINE DOCTOR WHO THE THREE 3 DOCTOR
+S DR JON PERTWEE, http://www.example.co.uk, 12
[download]

Here is the output.