carrerag has asked for the wisdom of the Perl Monks concerning the following question:

Hello PerlMonks,
I'm trying to remove lines from a file which have a duplicate id number but things don't seem to be working right. A sample data line is like so:
lat="37.4192" lng="-122.0574" United States ID No: 1123631397
And here is the code:
open (FILE, "<:utf8", "input.txt"); my @lines = <FILE>; my @uniq = (); my @purge = (); my %seen = (); foreach $line (@lines) { my $id = $line =~ m/ID No: (\d+)/; if ($seen{$id}++){ push (@uniq, $line); $new_uniq++; } else { push (@purge, $line); } open (MYFILE, ">:utf8", "data.txt"); print MYFILE @uniq; open (PURGE, ">>:utf8", "purge.txt"); print PURGE @purge;
After the first run, the data in data.txt appears to have the correct result. However, I was expecting that purge.txt would contain all lines that were removed as duplicates but, it only contains one line. Subsequent runs _always_ remove the first line of data from input.txt and places it in purge.txt. Could someone please point out my error? I'm trying to teach myself Perl but for some reason I can't get a grasp on what's going wrong here.
Thanks!

Replies are listed 'Best First'.
Re: Removing duplicate lines based on a match
by FunkyMonk (Bishop) on Jun 19, 2008 at 16:54 UTC
    In order to capture matches within a regular expression you'll have to give $id list context:
    my ( $id ) = $line =~ m/ID No: (\d+)/;

    Should solve your problem


    Unless I state otherwise, all my code runs with strict and warnings
Re: Removing duplicate lines based on a match
by kyle (Abbot) on Jun 19, 2008 at 17:16 UTC

    I see a couple of problems. Look here:

    foreach $line (@lines) { my $id = $line =~ m/ID No: (\d+)/; if ($seen{$id}++){ # ...

    You're going to get 1 for $id all the time because the match is in a scalar context instead of list context.

    The other thing is you want to say "if ( ! $seen{$id}++ ) ...". The first time an $id is seen, $seen{$id}++ will be false. Every time after that, it will be true. Since you want it to be true once and false ever after, add the negation.

    Here's the part I copied, rewritten:

    foreach $line (@lines) { my ($id) = $line =~ m/ID No: (\d+)/; if ( ! $seen{$id}++){ # ...

    I some other suggestions for you.

    First and most important, check the value of open! You try to open your files, but if they fail, you'll never know. Also, it's a generally good idea to use lexical filehandles instead of the global ones you're using.

    open my $in_fh, '<:utf8', 'input.txt' or die "Can't read 'input.txt': $!"; my @lines = <$in_fh>;

    Second, you're doing everything in memory. If your input is huge, you could run out of memory. Instead of reading every line and then looping over them, consider using a loop that reads a line at a time instead. You'd also want to open your output files at the start and write into them during processing instead of collecting their eventual contents in (memory-based) arrays. Like this:

    open my $in_fh, '<:utf8', 'input.txt' or die "Can't read 'input.txt': $!"; open my $purge_fh, '>>:utf8', 'purge.txt' or die "Can't append to 'purge.txt': $!"; open my $uniq_fh, '>:utf8', 'data.txt' or die "Can't write 'data.txt': $!"; my %seen; while ( my $line = <$in_fh> ) { my ($id) = ($line =~ /ID No: (\d+)/); if ( ! $seen{ $id }++ ) { print $uniq_fh, $line; $new_uniq++; } else { print $purge_fh, $line; } } close $in_fh or die "Close failed for input: $!"; close $purge_fh or die "Close failed for purge.txt: $!"; close $uniq_fh or die "Close failed for data.txt: $!";

    Finally, if you don't already, Use strict and warnings!

Re: Removing duplicate lines based on a match
by Fletch (Bishop) on Jun 19, 2008 at 17:18 UTC

    Context problem aside, it's usually better* to use a while loop and process your input line-by-line rather than slurping the entire file into an array.

    while( my $line = <FILE> ) { ## ... }

    Your sysadmin and the other users on the box will thank you when input.txt grows to 800M and you don't send the OS thrashing because you've got it all resident in memory.

    (*: FSVO "better", of course. Yes it may be fine in this particualr case and this set of inputs will neber eber grow that large fer sure. Really. But you'll be better served by developing a habit of defaulting to line-by-line and only slurping out of conscious choice after looking at the problem.)

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Removing duplicate lines based on a match
by wfsp (Abbot) on Jun 19, 2008 at 17:09 UTC
    I think you'll get a bit further if you use exists to test for uniqueness.
    #!/usr/local/bin/perl use strict; use warnings; my @lines = <DATA>; my (@uniq, @purge, %seen); for my $line (@lines) { my ($id) = $line =~ m/ID No: (\d+)/; if (not exists $seen{$id}){ push @uniq, $line; $seen{$id} = undef; } else { push @purge, $line; } } print qq{uniq\n}; print @uniq; print qq{purge\n}; print @purge; __DATA__ lat="37.4192" lng="-122.0574" United States ID No: 1123631397 lat="37.4192" lng="-122.0574" United States ID No: 1123631397 lat="37.4192" lng="-122.0574" United States ID No: 1123631398 lat="37.4192" lng="-122.0574" United States ID No: 1123631399 lat="37.4192" lng="-122.0574" United States ID No: 1123631400 lat="37.4192" lng="-122.0574" United States ID No: 1123631400
    output
    uniq lat="37.4192" lng="-122.0574" United States ID No: 1123631397 lat="37.4192" lng="-122.0574" United States ID No: 1123631398 lat="37.4192" lng="-122.0574" United States ID No: 1123631399 lat="37.4192" lng="-122.0574" United States ID No: 1123631400 purge lat="37.4192" lng="-122.0574" United States ID No: 1123631397 lat="37.4192" lng="-122.0574" United States ID No: 1123631400
Re: Removing duplicate lines based on a match
by carrerag (Initiate) on Jun 20, 2008 at 20:05 UTC
    The FAQ seems to frown on popping back in just to say thanks but, I just wouldn't feel right without saying so. So, to heck with the FAQ. Thanks to all who provided answers and suggestions. They were all very helpful! I have everything working as expected now.