Finding duplicates in a text file

ilottl has asked for the wisdom of the Perl Monks concerning the following question:

I have the following code which finds duplicates in 2 text files and spits out the duplicates into 3 seperate files. Its great, except that I now want it to ignore special characters such as brackets, commas, apostophes, quotes etc. How do I amend the code to ignore those types of characters but still report duplicates?

#!/usr/bin/perl -w
#use strict;

sub slurp{ local *ARGV; @ARGV = @_; <> }

my %where;

$file1 = "c:\\gpTemp\\chris\\validation\\brookdeliv\.txt";
$file2 = "c:\\gpTemp\\chris\\validation\\LD\.txt";

$both = "c:\\gpTemp\\chris\\validation\\ld\\Acts_delivered\.txt";
$infile1 = "c:\\gpTemp\\chris\\validation\\ld\\Acts_Missing\.txt";
$infile2 = "c:\\gpTemp\\chris\\validation\\ld\\Acts_Extra\.txt";

open BOTH, "> $both" or die "Cannot open $new for writing: $!";
open INFILE1, "> $infile1" or die "Cannot open $new for writing: $!";
open INFILE2, "> $infile2" or die "Cannot open $new for writing: $!";


$where{$_} .= "1" for slurp($file1);
$where{$_} .= "2" for slurp($file2);

for (sort keys %where) {
  my $where = $where{$_};

  if ($where =~ /12/) {
  print BOTH;

  } elsif ($where =~ /1/) {

  print INFILE1;

  } else {

  print INFILE2;

  }

  print ":\t$_";

}
[download]

Comment on Finding duplicates in a text file Download Code

Replies are listed 'Best First'.
Re: Finding duplicates in a text file by Util (Priest) on Jun 06, 2007 at 05:39 UTC
First, a piece of unsolicited advice: learn how to make your programs work under strictures, as if your career depends on that learning. If you had not commented out "use strict", you would have seen that your "open... or die" statements are using the non-existent var `$new`. The core of what you are trying to achieve involves the making a "canonical" version of your data. See the Wikipedia article on Canonicalization for details. Working, tested code: use strict; use warnings; # See http://en.wikipedia.org/wiki/Canonicalization sub canonicalize { my ($string) = @_; # Remove everything except certain characters. $string =~ tr{A-Za-z0-9 }{}cd; # Make case-insensitive (if you want) # $string = lc $string; return $string; } sub match_up_canonically { my ( $lines1_aref, $lines2_aref ) = @_; my %where; $where{ canonicalize($_) } \|= 1 for @{ $lines1_aref }; $where{ canonicalize($_) } \|= 2 for @{ $lines2_aref }; my ( @matches, @nonmatches1, @nonmatches2 ); for ( @{ $lines1_aref} ) { my $n = $where{ canonicalize($_) }; if ( $n == 3 ) { push @matches, $_; } elsif ( $n == 1 ) { push @nonmatches1, $_; } else { die "Can't happen"; } } for ( @{ $lines2_aref} ) { my $n = $where{ canonicalize($_) }; if ( $n == 3 ) { # Do nothing! # The matched lines already printed in the @lines1 loop. # ...or... # Print the matched lines again, because they may be # different, just not different in a way that matters. # push @matches, $_; } elsif ( $n == 2 ) { push @nonmatches2, $_; } else { die "Can't happen"; } } return( \@matches, \@nonmatches1, \@nonmatches2 ); } my @lines1 = ( 'able baker charlie', 'roger, fox, dog', 'Gomez Morticia Cousin-Itt', 'Wednesday Pugsley Lurch', ); my @lines2 = ( 'Gomez Morticia Cousin_ITT', 'roger; fox; dog', 'Wednesday Pugsley Fester', 'able baker charlie', ); my ( $m, $n1, $n2 ) = match_up_canonically( \@lines1, \@lines2 ); print join "\n", 'Matched:', @{ $m }, "\n"; print join "\n", 'Non-matched1:', @{ $n1 }, "\n"; print join "\n", 'Non-matched2:', @{ $n2 }, "\n"; [download]	[reply] [d/l] [select]
Re: Finding duplicates in a text file by GrandFather (Saint) on Jun 06, 2007 at 04:12 UTC
use strict; use warnings; my @lines1 = ( 'The, slow, lazy, brown and green dog', qq~I now want it to ignore special characters such as (), ",", ', +" etc.~, qq~I have code which finds duplicates in 2 text files~, ); my @lines2 = ( 'The slow lazy brown and green dog', qq~I now want it to ignore special characters such as (), ",", ', +" etc.~, qq~I have code which finds duplicates in 2 files and spits into 3 +files~, ); my %where; @lines1 = map {(my $clean = $_) =~ s/[^\s\w]//g; [$_, $clean]} @lines1 +; @lines2 = map {(my $clean = $_) =~ s/[^\s\w]//g; [$_, $clean]} @lines2 +; $where{$_->[1]} = ["1", $_->[0]] for @lines1; $where{$_->[1]} = [($where{$_->[1]}[0] .= "2"), $_->[0]] for @lines2; for (sort keys %where) { my $where = $where{$_}->[0]; my $what = $where{$_}->[1]; if ($where =~ /12/) { print "12: $what\n"; } elsif ($where =~ /1/) { print "1: $what\n"; } else { print "2: $what\n"; } } [download] Prints: `2: I have code which finds duplicates in 2 files and spits into 3 file +s 1: I have code which finds duplicates in 2 text files 12: I now want it to ignore special characters such as (), ",", ', " e +tc. 12: The slow lazy brown and green dog` [download] is close, but only shows one variant of the dog line. It's really a question of how you actually want to handle that situation. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]