http://qs1969.pair.com?node_id=289750

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I wrote a script that checks fields (delimiter is a pipe) where if the last entry has the same last four field entries it will eliminate that last line with the duplicates. The script works but as usual I like to get it to look better and this site always provides a great way to critique and help me write a more efficient script. Please look at the script and advise how I can make it better.
Thanks!

Script:
use strict; my $db = 'C:\Inetpub\wwwroot\cgi-bin\test4.txt'; open(DATA, "$db") || die "Can not open: $!\n"; my @dat = (<DATA>); close(DATA); open(DATA, "$db") || die "NO GO: $!\n"; my @files; foreach (@dat) { my $key = join(" ", (split /\|/, $_)[1,2,3,4]); #print "$key\n"; push(@files,$key); } #print "@files\n"; foreach (@files) { chomp; $key{$_}++; } foreach (keys %key) { if($key{$_} > 1) { #print "$_ \n"; my $db = 'test4.txt'; open(DATA, "$db") || die "Can not open: $!\n"; my @lines = (<DATA>); pop @lines; open(FOUT,"> test4.txt") or die $!; print FOUT @lines; close FOUT; } } close(DATA);
data in test4.txt:
34|4|45|56|45 45|34|45|00|23 45|34|45|00|27 34|4|456|56|03 36|4|456|56|03 #This line will be eliminated after the script is run +because last four fields are the same as the previous line.

Replies are listed 'Best First'.
Re: Improvement on script needed.
by bm (Hermit) on Sep 08, 2003 at 12:48 UTC
    Your script would be much quicker if you took the open and close statements out of the foreach (keys %key) loop.

    Do those statements only once, and maintain your data in arrays. Disk operations should be minimal, and in your case you only need to read and write once.
    --
    bm

Re: Improvement on script needed.
by asarih (Hermit) on Sep 08, 2003 at 12:55 UTC
    Is there a reason that you have to split the lines in the first place? That is, if the last four fileds are significant and always lumped together, you shouldn't waste time splitting and joining those fields. Also, you shouldn't loop over <DATA> more than once unless absolutely necessary. It can be expensive if the file is large.
    my %seen; while (<DATA>) { /^([^\|]+)\|(.*)/; # split at the first "|" my ($key, $value) = ($2,$1); next if $seen{$key}; print "$key: $value\n"; $seen{$key}++; } __DATA__ 34|4|45|56|45 45|34|45|00|23 45|34|45|00|27 34|4|456|56|03 36|4|456|56|03
      Good use of a "seen" hash, asarih, but i have to say that i would rather use this:
      my ($value,$key) = $_ =~ /^([^\|]+)\|(.*)/;
      than explicitly "spell out" $1 and $2. But if all you want to do is split at the first pipe, just use split:
      my ($key,$value) = split(/\|/,$_,2); next if $seen{$value}; print "$key: $value\n"; $seen{$value}++;
      I don't have time for some benchmarking right now, but substr and index are fast:
      my $index = index($_,'|'); my $key = substr($_,0,$index); my $value = substr($_,$index);
      Just some more ways to do it. ;)

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
        I tried as suggested but cant get your new script to work on my text file. Please advise what Iam doing wrong. Thanks.
        my $db = 'C:\Inetpub\wwwroot\cgi-bin\test4.txt'; open(DATA, "$db") || die "Can not open: $!\n"; my @dat = (<DATA>); close(DATA); open(DATA, "$db") || die "NO GO: $!\n"; my %seen; while (<DATA>){ my ($value,$key) = $_ =~ /^([^\|]+)\|(.*)/; my ($key,$value) = split(/\|/,$_,2); next if $seen{$value}; print "$key: $value\n"; push(@files,$key); my @files = (<DATA>); print DATA @files; $seen{$value}++; } close(DATA);
      Still cant get it to write to file:
      my $db = 'C:\Inetpub\wwwroot\cgi-bin\test4.txt'; open(DATA, "$db") || die "Can not open: $!\n"; my @dat = (<DATA>); close(DATA); open(DATA, "$db") || die "NO GO: $!\n"; my %seen; while (<DATA>) { /^([^\|]+)\|(.*)/; # split at the first "|" my ($key, $value) = ($2,$1); next if $seen{$key}; print "$key: $value\n"; $seen{$key}++; push(@files,$key); pop @files; open(DATA,"> test4.txt") or die $!; print DATA @files; } close(DATA);

        There are all sorts of problems with this code. You seem to have copied snippets from various answers to your question into your code without understanding any of them. I suggest you spend some considerable time studying the resources listed here.

        However, to return to your immediate problems, let's look, for example, at the number of times you use open in your code. Lines 3-7 are as follows:

        open(DATA, "$db") || die "Can not open: $!\n"; my @dat = (<DATA>); close(DATA); open(DATA, "$db") || die "NO GO: $!\n";

        Why on earth do you want reopen (for reading) the same file that you've just closed, once you've already read it into an array?

        As perl is forgiving, you can needlessly open (and/or close) the same file as many times as you want without it complaining, but...

        More importantly, the third time you use open is inside a while loop:

        while (<DATA>) { # ... open(DATA,"> test4.txt") or die $!;; # ... }

        Apart from the fact that - at least for clarity's sake - you shouldn't be using the same filehandle for two completely different files (and that in any case DATA is not a particularly good filehandle to choose...) - try to envision what the above is doing. As the open line is inside a loop, it will, for each iteration of the loop, open 'test4.txt' for overwriting. If you don't understand that, try running this:

        my $i = 0; while ($i < 5) { open (OUT, '>oops.txt') || die "NO GO: $!\n"; print OUT $i; $i++; }

        as compared with this:

        open (OUT, '>oops.txt') || die "NO GO: $!\n"; my $i = 0; while ($i < 5) { print OUT $i; $i++; }

        dave

Re: Improvement on script needed.
by benn (Vicar) on Sep 08, 2003 at 13:01 UTC
    I'm not going to remove all the fun of coding from you, so here's a hint - you can do all this in a single loop...read, increment_keyhash,write (or not, as case may be). You can 'slurp' all in one go (as you do) and write to the same file, or read from one and write to another, then rename at the end.

    Then again, you could simply use Tie::File and grep, but where's the fun in that? :)

    Cheers, Ben.

Re: Improvement on script needed.
by Aragorn (Curate) on Sep 08, 2003 at 13:01 UTC
    Something like this?
    #!/usr/bin/perl use strict; use warnings; my ($cur_rec, $prev_rec, $line) = ("", "", ""); while ($line = <STDIN>) { $cur_rec = join("", (split(/\|/, $line))[1..4]); if ($cur_rec ne $prev_rec) { print $line; } $prev_rec = $cur_rec; }
    With this script, you can process files of arbitrary length, because the while file isn't stuffed into a hash. It processes from standard input to standard output, thus you can give it any file you like, instead of hardcoding filenames in the script. It's a lot more efficient than opening, writing, and closing the same file over and over in a loop.

    Arjen