wink has asked for the wisdom of the Perl Monks concerning the following question:

My perl is a bit rusty so I'm posting this to make sure I'm doing things the "right way".

I've got a one-to-many list of values that I read in using the following code (the %dtgs hash is defined globally)

open(my $dtg_file, "<", $infile) or die "Unable to open $infile: $!\n" +; while(<$dtg_file>) { chomp; my ($dtg,@files) = split /:/; $dtgs{$dtg} = \@files; } close $dtg_file;

I do some processing and when a match is found to one of the files, I want to remove it to speed up further processing (there are about 60k files and they are being compared to over 100 million files, looking for matches).

sub remove_from_dtgs { my ($dtg,$file) = @_; my @files = grep {$_ ne $file} @{$dtgs{$dtg}}; if(@files == 0) { delete $dtgs{$dtg}; } else { $dtgs{$dtg} = \@files; } }

I want to make sure that I'm not creating a memory leak by replacing $dtgs{$dtg} with the new array. If memory serves (no pun intended), perl will detect there are no longer any references to the old array and will free up the memory. But this script is going to run for a long time (see 100 million files above) and I want to avoid any issues.

Other optimization suggestions are also welcome. Thanks in advance!

Edited with corrections from kennethk

Replies are listed 'Best First'.
Re: Avoiding Memory Leaks
by kennethk (Abbot) on Sep 23, 2013 at 20:37 UTC
    Your recollection is correct: garbage collection should happen following both delete $dtgs{$dtg}; and $dtgs{$dtg} = \@files;, since both remove an array ref from %dtgs.

    Depending on your array sizes, you may get more efficiency from a splice instead of your grep, since you need to create a new array and copy nearly all the old file names. That code might look something like

    sub remove_from_dtgs { my ($dtg,$file) = @_; for my $i (reverse 0 .. $#files) { splice @{$dtgs{$dtg}}, $i, 1 if $dtgs{$dtg}[$i] eq $file; } delete $dtgs{$dtg} unless @{$dtgs{$dtg}}; }

    If you know lists are unique (no repeats) and that this is the only routine that modifies the arrays, you can add some Loop Control and do a little better:

    sub remove_from_dtgs { my ($dtg,$file) = @_; for my $i (reverse 0 .. $#files) { if ($dtgs{$dtg}[$i] eq $file) { splice @{$dtgs{$dtg}}, $i, 1; delete $dtgs{$dtg} unless @{$dtgs{$dtg}}; last; } } }

    Note in the original, you missed parentheses in your test, and that all logical tests are scalar context, so the scalar is unnecessary.

    Of course, this is an optimization, so make sure to actually test (perhaps with Devel::NYTProf) rather than guess at what's slow.

    Update: Or, of course, given a uniqueness constraint, you could just use a hash:

    open(my $dtg_file, "<", $infile) or die "Unable to open $infile: $!\n" +; while(<$dtg_file>) { chomp; my ($dtg,@files) = split /:/; $dtgs{$dtg}{$_}++ for @files; } close $dtg_file; sub remove_from_dtgs { my ($dtg,$file) = @_; delete $dtgs{$dtg}{$file}; delete $dtgs{$dtg} if !keys %{$dtgs{$dtg}} }

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Thanks for your reply! The majority (something like 80%) of the DTGs have a single file associated with them, less than 1% have 4 or more, and only a few have 10+. So I don't think a splice will offer much of a performance benefit over the grep. Definitely will look into it thought, thanks! It's not a function I have used much

      I also apologize for the formatting errors. I had to copy this by hand as it's on a non-internet-facing system.