Cincyman has asked for the wisdom of the Perl Monks concerning the following question:

OK I am trying to compare 2 list of names and get matches between the lists, my actual data have variations of names is why I am using MatchNames. Everytime I run though my loop I eat an additional 3MB of Memory and I don't know why. Any Help is Appreciated
use Lingua::EN::MatchNames; open (TERMFILE, $ARGV[0]); my(@termusers) = <TERMFILE>; chomp @termusers; open (USERFILE, $ARGV[1]); my(@curusers) = <USERFILE>; chomp @curusers; open (DUPFILE, ">dup.$ARGV[1]"); ####Lets Create the Hash################ foreach $curuser (@curusers) { chomp $curuser; $curusercounter++; print "Adding current user $curusercounter $curuser to Array\n +"; $curlookup{$curusercounter} = $curuser; } foreach $termuser (@termusers) { chomp $termuser; $termusercounter++; print "Adding Term user $termusercounter $termuser to Array\n" +; $termlookup{$termusercounter} = $termuser; } @termuserlist = keys %termlookup; @curuserlist = keys %curlookup; foreach $termusername (@termuserlist) { &NameComp($termlookup{$termusername}) } sub NameComp () { foreach $curusername (@curuserlist) { print "comparing $_[0] to $curlookup{$curusername}\n"; my $name_score = (name_eq($_[0], $curlookup{$curuserna +me})); print "$name_score\n"; if ($name_score >= 80){ print "Found Match $curlookup{$curusername}\n" +; } } } close (TERMFILE); close (USERFILE); close (DUPFILE);
executing with perl -w matcher.pl test.xt test2.txt test.txt contains the following entries
Robert Forbes
Thomas Forbes
Jane Doe
John Doe
Bad User
test2.txt contains the following
Tom Forbes
Bob Forbes
Janie Doe
Johnny Doe
Wrong User
I am also am getting the following error message
use of uninitialized value in numeric ge (>=) at matcher.pl line 44, <USERFILE> line 5.

Updated Steve_p - changed module mentioned in title from MatchNames.pm to Lingua::EN::MatchNames

Replies are listed 'Best First'.
Re: Memory Leak when using Lingua::EN::MatchNames
by TomDLux (Vicar) on May 05, 2004 at 00:11 UTC

    I don't know anything about MatchNames, but I'd like to help you simplify the rest of your code.

    • When you open a file, close it as soon as possible, rather than leaving it hanging about. Oh, it's been a few decades since open files have had so significant an effect on performance, but it's still clumsy and unattractive. Close the file and you can be sure of its status:
      open (TERMFILE, $ARGV[0]); my(@termusers) = <TERMFILE>; chomp @termusers; close TERMFILE;
    • You read in two arrays, copy to hashes which you use as if they were arrays, then generate arrays of integers used to index the hashes, and you iterate through the indices to process the names.

      Why not delete everything between opening DUPFILE and the loop? Then, instead of iterating over the indices in the hash/array, you can simply iterate over the names in the first array:

      foreach $termusername (@termusers) { NameComp( $termusername ); } # or simpler .... NameComp( $_ ) foreach ( @termusers );
    • You pass the term user name as an argument to NameComp(), but you reach out and access @curuserlist as a global variable. Accessing global variables is always a warning that you should possibly be doing something different.

      The simplest solution is to pass the array as an argument, along with the name to look up. You want to be careful to pass a reference, not the whole array, otherwise you would be copying it each time you access the routine:

      NameComp( $termusername, \@curusers );

      Passing the array each time is somewhat clumsy. That isn't such a big deal here, but if you invoke the routine from a number of places, you might get tired of providing the extra argument which isn't really relevant to what you're doing. However, at this point my suggestions may begin to go against my claim of simplifying your code ...

      Using a module becomes attractive at that point. It would have two routines, one to read in the file and generate its private list of users, and the NameComp routine.

      package NameComp; use Lingua::EN::MatchNames; my @curUsers; sub readFile { my ( $filename ) = @_; open USERFILE, $filename or die $!; @curUsers = <USERFILE>; chomp @curUsers; close USERFILE; } sub compare { my ( $termUser ) = @_; foreach ( @curUsers ) { # something involving $termUser and $_ } } package Main; die ("Usage: $0 <path to term user list> <path to current user list>" +) unless( 2 == @ARGV ); open (TERMFILE, $ARGV[0]) or die $!; my( @termusers ) = <TERMFILE>; chomp @termUsers; close TERMFILE; NameComp::readFile( $ARGV[1] ); for ( @termUsers ) { NameComp::compare( $_ ); }
      The only problem with this is if your script works so well that your next script uses two different sets of comparisons, let's say one of a list of hockey players, and another of hurricanes of the 20th century. The module variable @curusers needs to hold hockey player names, one minute, and hurricane names, the next minute. The solution is to create an object; the one disadvantage is that you need to carry around a reference to your object instance:
      package NameComp; use Lingua::EN::MatchNames; sub new { my ( $class, $filename ) = @_; my $self = {}; bless $self, $class; $self->readFile( $filename ); return $self; } sub readFile { my $self = shift; my ( $filename ) = @_; open USERFILE, $filename or die $!; @{$self->{users}} = <USERFILE>; chomp @{$self->{users}}; close USERFILE; } sub compare { my $self = shift; my ( $termUser ) = @_; foreach ( @{$self->{users} ) { # something involving $termUser and $_ } } package Main; die ("Usage: $0 <path to term user list> <path to current user list>" +) unless( 2 == @ARGV ); open (TERMFILE, $ARGV[0]) or die $!; my( @termusers ) = <TERMFILE>; chomp @termUsers; close TERMFILE; my $comparer = new NameComp ( $ARGV[1] ); for ( @termUsers ) { $comparer->compare( $_ ); }

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

Re: Memory Leak when using Lingua::EN::MatchNames
by eXile (Priest) on May 04, 2004 at 21:30 UTC
    Hi, I've reproduced your problem, and I think I've got a fix. If I look at the documentation for Lingua::EN::MatchNames I see you should use 4 arguments instead of 2:
    use Lingua::EN::MatchNames; $score= name_eq( $firstn_0, $lastn_0, $firstn_1, $lastn_1 );
    When I change your NameComp function to this:
    sub NameComp () { foreach $curusername (@curuserlist) { my ($firstn_0, $lastn_0)= split '\s+', $_[0]; my ($firstn_1, $lastn_1)= split '\s+', $curlookup{$cur +username}; print "comparing $firstn_0, $lastn_0 to $firstn_1, $la +stn_1\n"; my $name_score = name_eq($firstn_0, $lastn_0, $firstn_ +1, $lastn_1); print "$name_score\n"; if ($name_score >= 80){ print "Found Match $curlookup{$curusername}\n" +; } } }
    it not only runs a lot faster, the memory usage doesn't increase after the first round of NameComp. But this doesn't explain why the memory-usage keeps increasing in the original case, I'd like to know as well. As you might have seen in a previous post (Memory usage breakup), I'm quite interested in how to manipulate perl memory usage myself.

    Please be very cautious naming stuff like this 'memory leaks', It's quite normal perl uses a lot of memory because all memory that is released when variables are not refered to anymore, is not released to the OS, but perl keeps this memory allocated (at least according to theory, in the post I mentioned people claim perl releases memory to the OS, which I haven't been able to reproduce on FreeBSD).

Re: Memory Leak when using Lingua::EN::MatchNames
by allolex (Curate) on May 04, 2004 at 21:41 UTC

    I've tried to improve your code, and I have tested what I am posting here.

    You have a couple of problems. The first one is that you are creating a number of unnecessary arrays with your extra foreach loops. I've tried to optimize that by using keys with hashes as opposed to creating temporary lookup arrays, and then creating a subroutine to handle reading in the files, etc. When you're done with a file, it is a good idea to close it to avoid leaving dangling filehandles.

    The other thing has already been pointed out. If you look at the Lingua::EN::MatchNames documentation, you can see it expects fn1, ln1, fn1, ln2. I fixed that, kind of (see my comment). The name_eq() function will return undef if there is no possible match, so I adding handling for that as well. That accounts for the "uninitialized value" warnings.

    #!/usr/bin/perl use strict; use warnings; use Lingua::EN::MatchNames; my $termfile = shift; my $userfile = shift || die "Usage: $0 TERMFILE USERFILE\n"; my %curlookup = getlist($userfile); my %termlookup = getlist($termfile); open my $dfh, ">", "dup.$userfile"); foreach my $termusername (keys %termlookup) { NameComp( $termlookup{$termusername} ) } close $dfh; # getlist takes a filename as an argument sub getlist { my $filename = shift; my $counter; my %results; open my $fh, "<", "$filename"; while ( <$fh> ) { chomp; ++$counter; next unless m/[A-Za-z]/; $results{$counter} = $_; print "Adding user $counter $_ from file \'$filename\' to hash +\n"; } close $fh; return %results; } sub NameComp { # no parens, this is not a function prototype my $compname = shift; foreach my $curusername (keys %curlookup) { print "Comparing \'$compname\' to \'$curlookup{$curusername}\' +\n"; # This method is not good because it assumes a ' FN -SPACE- LN + ' format my @compname = split /\s+/, $compname; my @curname = split /\s+/, $curlookup{$curusername}; my $name_score = name_eq( $compname[0], $compname[1], +$curname[0], $curname[1] ); if ( $name_score ) { if ( $name_score >= 80 ) { print "Found Match $curlookup{$curusername} with a sco +re of $name_score.\n\n"; } } else { print "\t\tNo possible match.\n\n"; } } }

    Best of luck to you.

    --
    Damon Allen Davison
    http://www.allolex.net