String matching

jwesley has asked for the wisdom of the Perl Monks concerning the following question:

Hello, here is my problem. I have written a script that looks for duplicate strings in 3 different files and outputs which stings are duplicated and in which files. This worked great until I was tasked with modifying the script to search for an unknown number of files and report on those as well. Below is the current script:

use File::Copy;
use File::Find;


### Parse numerical characters and trailing white-space from rmtdb.lrl
+ for matching.
copy("rmtdb.lrl" , "rmtdb.tmp")
   or die "rmtdb.lrl file cannot be copied: $!\n";
system "cat rmtdb.tmp | cut -d ' ' -f1 > rmtdb.tmp1";

### Open File Handles.
open comdb, "< dblist.comdbg"
   or die "Cannot open connection to dblist.comdb: $!\n";
open varldb, "< dblist.varldb"
   or die "Cannot open connection to dblist.varldb: $!\n";
open rmtdb, "< rmtdb.tmp1"
   or die "Cannot open connection to rmtdb.lrl: $!\n";

### Create Lists
@comdb = <comdb>;
@varldb = <varldb>;
@rmtdb = <rmtdb>;

### Close File Handles.
close comdb; close varldb; close rmtdb;

### Case-shift rmtdb to lowercase.
foreach (@rmtdb) {s/$_/\L$_/gi;}

### Begin matching.
foreach $db (@comdb)  # comdb against varldb.
{
   @result = grep /^\Q$db\E$/i , @varldb;
   push(@com2var , @result);
}

foreach $db (@comdb) # comdb against rmtdb.
{
   @result = grep /^\Q$db\E$/i , @rmtdb;
   push(@com2rmt , @result);
}

foreach $db (@varldb) # varldb against rmtdb.
{
   @result = grep /^\Q$db\E$/i , @rmtdb;
   push(@var2rmt , @result);
}

### Sort matches for final output.
foreach (@com2var)
{
   chomp($_);
   $hash1{$_}="dblist.comdbg dblist.varldb";
}

foreach (@com2rmt)
{
   chomp($_);
   if (exists $hash1{$_})
   {
   $hash1{$_}="dblist.comdbg dblist.varldb rmtdb.lrl";
   } else {
   $hash1{$_}="dblist.comdbg rmtdb.lrl";
   }
}

foreach (@var2rmt)
{
   chomp($_);
   if (! exists $hash1{$_})
   {
   $hash1{$_}="dblist.varldb rmtdb.lrl";
   }
}
  
### Final Output.
print "\n";
foreach (keys %hash1)
{
   print "$_ is duplicated in:  $hash1{$_}\n";
}
print "\n";

### Cleanup
unlink "rmtdb.tmp", "rmtdb.tmp1";

exit 0;
[download]

Now there can be up to 20 rmt(*)db.lrl files in a given directory. I've figured out how to find the files, but I'm having trouble with the matching afterwards.

Comment on String matching Download Code

Replies are listed 'Best First'.
Re: Sting matching by wind (Priest) on Apr 21, 2011 at 08:38 UTC
Your script can be done much simpler if you generalize it. use File::Copy; use File::Find; use strict; use warnings; ### Parse numerical characters and trailing white-space from rmtdb.lrl + for matching. copy("rmtdb.lrl" , "rmtdb.tmp") or die "rmtdb.lrl file cannot be copied: $!\n"; system "cat rmtdb.tmp \| cut -d ' ' -f1 > rmtdb.tmp1"; my %lines; for my $file (qw(dblist.comdbg dblist.varldb rmtdb.tmp1)) { open my $fh, $file or die "Can't open $file: $!"; while (<$fh>) { chomp; # Special case of rmtdb $_ = lc $_ if $file eq 'rmtdb.tmp1'; $lines{$_}{$file}++; } close $fh; } foreach my $line (keys %lines) { my @files = sort keys %{$lines{$line}}); next if @files < 2; print "$line is duplicated in: " . join(' ', @files) . "\n"; } ### Cleanup unlink "rmtdb.tmp", "rmtdb.tmp1"; [download] Then if you want to process 20 files, you can just add them to the loop	[reply] [d/l]
Re^2: Sting matching by jwesley (Initiate) on Apr 22, 2011 at 00:17 UTC
Thank you for the response. I haven't tried this yet but I will shortley. The problem I'm having is first I have to search and grep for each rmtdb.lrl file because each server is different. This is what I'm working with for that purpose. `### Collect and parse rmt(*)db.lrl files. my @dirs=qw(/bb/bin/jwarfel); find(\&parse, @dirs); sub parse { if (/^rmt[1-9]\|[1-9][1-9]\|db.lrl$/) { copy("$_" , "$_.tmp"); system "cat $_.tmp \| cut -d ' ' -f1 > $_.tmp1"; open $_, "< $_.tmp1"; @$_ = <$_>; close $_; unlink "$_.tmp", "$_.tmp1"; foreach (@$_) {s/$_/\L$_/gi;} $rmtdb="$_"; push(@rmtdbs , $rmtdb); } }` [download]	[reply] [d/l]
Re: Sting matching by John M. Dlugosz (Monsignor) on Apr 21, 2011 at 08:29 UTC
Try using the `readmore` tag to keep your post more concise. Can you be more specific as to what you're having trouble with? You know how to find duplicates between files. You found the files of interest. I don't understand "matching afterwards". Repeating the code three times for each pair of files is bad to begin with, and clearly will not scale to larger groups of files! Use a loop. You don't need separate hashes for everything, and I don't see the point of temp files. Here's how I would do it: For each file, read each line and hash it. Store the hash (not the whole line) as the key to a master hash, with the value being a list of file names it was seen in. So, for each line, push the current file name onto the value of that line's key. After going through all the files, iterate through each hash entry and note which ones have more than one item in the value.	[reply]
Re: String matching by toolic (Bishop) on Apr 22, 2011 at 01:47 UTC
Unrelated to your problem... system "cat rmtdb.tmp \| cut -d ' ' -f1 > rmtdb.tmp1"; is more simply coded as: `system "cut -d ' ' -f1 rmtdb.tmp > rmtdb.tmp1";` [download] Cat_(Unix)#Useless_use_of_cat	[reply] [d/l]
Re^2: String matching by jwesley (Initiate) on Apr 25, 2011 at 03:25 UTC
Thanks for everyone's advice and help on this. Here was my final code source which works fantastic in DEV so far. #!/usr/bin/perl use File::Find; use strict; use warnings; my @dbs; my @dirs=qw(/Users/jwarfel/Documents/perl); find(\&parse, @dirs); sub parse { if (/^rmt[0-9]\|[0-9][0-9]\|db.lrl$/) { my $rmtdb="$_"; push(@dbs , $rmtdb); } } foreach (qw(dblist.comdbg dblist.varldb)) {push(@dbs , $_);} my %lines; for my $file (@dbs) { open my $fh, $file or die "Can't open $file: $!"; while (<$fh>) { chomp; # Special case of rmtdb $_ = lc $_; if ($_ =~ /^(\w+)\s+/) { $_ = $1; } $lines{$_}{$file}++; } close $fh; } foreach my $line (keys %lines) { my @files = sort keys %{$lines{$line}}; next if @files < 2; print "$line is duplicated in: " . join(' ', @files) . "\n"; } exit 0; [download]	[reply] [d/l]