Possibly long lists and memory leaks

carthag has asked for the wisdom of the Perl Monks concerning the following question:

I'm (re)writing a module (and will eventually put it on CPAN, when the bugs are sorted out) that generates URL-lists based on user input, since the previous version was using up horrible amounts of memory (I've had to kill httpd processes with VSIZEs of over 2 gigs).

However, it did not help putting in a 'maximum' list-length, as I had expected. I tried letting it run over the night, and got the same problem again... Does anyone have an idea?

The module is used as follows:

my $uri = $q->param('uri')||"";
my $uriobj = URI::Fusker->new(uri => $uri, maximum => 100);
my @urls = $uriobj->as_list;
#After which @urls is popped into a template, printed, etc
[download]

And here is the module code:

package URI::Fusker;
#copyright 2001, 2002 carthagtuek@softhome.net

#####################################################################

use strict;

#####################################################################

require Exporter;

our @ISA         = qw(Exporter);
our @EXPORT      = qw();
our @EXPORT_OK   = qw(length generateList generateSequences generateRa
+nges);
our %EXPORT_TAGS = (
                     All => [qw(length generateList generateSequences 
+generateRanges)]
                   );
our $VERSION     = 1.0;

#####################################################################

sub new
{
    my $class = shift;
    
    my $self = {
                uri => "",
                maximum => 10000, #Default maximum length for the gene
+rated list
                @_
               };
    
    return bless $self, $class;
}

#####################################################################

sub as_list
{
    my $self = shift;
    
    if ($self->length <= $self->{maximum})
    {
        return generateList($self->{uri});
    }
    else
    {
        return; #Todo: Make a shorter list.
    }
}

sub uri
{
    my $self = shift;
    
    if (@_) { $self->{uri} = shift }
    
    return $self->{uri};
}

sub maximum
{
    my $self = shift;
    
    if (@_) { $self->{maximum} = shift }
    
    return $self->{maximum};
}

#####################################################################

sub length
{
    my ($self, $uri);
    my $length = 0;
    
    if (ref($_[0]))
    {
        $self = shift;
        $uri = $self->{uri};
    }
    else
    {
        $uri = shift;
    }
    
    while ($uri =~ m/((?:\{[^\}]+\})|(?:\[[^\]]+\]))/ig)
    {
        if    ($1 =~ m/\{([^\}]+)\}/i)
        {
            $length++;
            $length++ while ($1 =~ m/,/);
        }
        elsif ($1 =~ m/\[([^\]]+)\]/i)
        {
            my ($start, $stop) = sort split('-',$1);
            
            $length += $stop - $start + 1;
        }
    }
        
    return $length;
}

#####################################################################

sub generateList
{
    my @urls = @_;
    
    while (grep /\{[^\}]+\}/, @urls) 
    {    #as long as there are {}s in the url list, split out the sequ
+ences.
        @urls = generateSequences(@urls);
    }
    
    while (grep /\[[^\]]+\]/, @urls) 
    {    #as long as there are []s in the url list, split out the rang
+es.
        @urls = generateRanges(@urls);
    }
    
    return @urls;
}

sub generateSequences
{
    my @strings = ();
    
    while (@_)
    {
        my $string = shift @_;
        $string =~ /^([^\{]+)\{([^\}]+)\}(.*)$/;    #something{somethi
+ng}something
        my $begin = $1||'';
        my $mid = $2||'';
        my $end = $3||'';
        
        if ($mid)
        {    #there were a set of {}
            my @items = split(',',$mid);
            
            foreach (@items)
            {    #Taking each item and putting it in the list.
                push @strings, "$begin$_$end";
            }
        }
        else
        {    #there weren't, let's just push the string back out again
            push @strings, $string;
        }
    }
    
    return @strings
}

sub generateRanges
{
    my @strings = ();
    
    while (@_)
    {
        my $string = shift @_;
        $string =~ /^([^\[]+)\[([^\]]+)\](.*)$/;    #something[somethi
+ng]something
        my $begin = $1||'';
        my $mid = $2||'';
        my $end = $3||'';
        
        if ($mid)
        {    #there were a set of []
            my ($start, $stop, undef) = sort split('-',$mid);
            my $digits = 0;
            
            if ($start =~ /\d+/)
            {    #There is one or more digit, so we'll remove anything
+ else that might be there
                $digits = length($start);
                $start =~ s/\D+//g;
                $stop =~ s/\D+//g;
            }
            
            #Catching crap input here:
            $stop = substr("000000$stop",-length($start)) if ($stop =~
+ /^0/); #Make sure they are the same length, eh
            
            foreach ($start .. $stop)
            {    #Taking each item and putting it in the list.
                push @strings, "$begin".("0"x($digits-length($_)))."$_
+$end";
            }
        }
        else
        {    #there weren't, let's just push the string back out again
            push @strings, $string;
        }
    }
    
    return @strings
}

#####################################################################

1;

__END__

=head1 NAME

URI::Fusker - URL generation based on ranges and sequences

=head1 SYNOPSIS

 $uri = URI::Fusker->new("http://www.server.com/directory/{a,b,c}/[1-1
+0].html");
 @urls = $uri->as_list;

=head1 DESCRIPTION

This module emulates the part of the behaviour of the commandline tool
+ cURL. In particular,
it makes possible URL generation based on ranges and sequences as foll
+ows:

=over 3

=item *

An URL containing a range like [1-10] will be expanded to a number of 
+URLs, each with a
number from the range.

=item *

An URL containing a sequence like {a,b,c} will be expanded into a numb
+er of URLs, each
with an item from the sequence.

=back

Thus, in the example seen in L</SYNOPSIS>, 30 URLs will be generated.

=head2 Methods

=over 3

=item new( [uri => ""], [maximum => 10000] )

Creates a new URI::Fusker object. The uri parameter is the uri from wh
+ich to generate
the list, and the maximum parameter is the maximum length (the default
+ is 10,000) of
said list.

=item $uri->as_list

Creates the list using the associated uri, unless the list will be lon
+ger than the set
maximum.

=item $uri->uri([$uri])

Sets or returns the uri for the object.

=item $uri->maximum([$max])

Sets or returns the maximum length of the object.

=item $uri->length

Will attempt to calculate the length of the list that is to be generat
+ed.

=item generateList($uri)

Will generate the list regardless of the length. Be careful, this can 
+be very memory
intensive. It is suggested to only use L</as_list>.

=head1 AUTHOR

Mikkel Eriksen, carthagtuek@softhome.net

=head1 SEE ALSO

http://curl.haxx.se/

=head1 COPYRIGHT

Copyright 2002 Mikkel Eriksen.

This program is free software. You may copy or redistribute it under t
+he same terms as Perl itself.

=cut
[download]

Comment on Possibly long lists and memory leaks Select or Download Code

Replies are listed 'Best First'.
Re: Possibly long lists and memory leaks by carthag (Scribe) on Dec 03, 2002 at 16:51 UTC
With thanks to tye and jdporter, the length method has been rewritten as follows: sub length { my ($self, $uri); my $length = 1; if (ref($_[0])) { $self = shift; $uri = $self->{uri}; } else { $uri = shift; } while ($uri =~ m/((?:\{[^\}]+\})\|(?:\[[^\]]+\]))/ig) { if ($1 =~ m/\{([^\}]+)\}/i) { my $temp = $1; $length = ($temp =~ tr/,/,/); } elsif ($1 =~ m/\[([^\]]+)\]/i) { my ($start, $stop) = sort split('-',$1); $stop = $start+10 unless ($stop); if ($start =~ /\d+/) { #There is one or more digit, so we'll remove anything + else that might be there $start =~ s/\D+//g; $stop =~ s/\D+//g; } $length = $stop - $start + 1; } } return $length; } [download] However, this still isn't entirely perfect. These are some examples from my current log that generate inconsistent results: scalar($uriobj->as_list) $uriobj->length $uri 1 0 foo{011-047}bar 0 -398 foo[800-1199]bar 0 -348 foo[850-1199]bar 0 -298 foo[900-1199]bar 0 50 foo[001-50]bar 0 -97 foo[2-100]bar 0 -97 foo[2-100]bar 0 -7 foo[2-10]bar 1 0 foo{001-188}bar 0 -140 foo[01-20]bar 0 -3 foo[1-05]bar 0 -46 foo[53-100]bar 975 -23 foo[001+-+025]bar 9 6 foo{one,two,three}/[01-03]bar 27 6 foo{one[01-03],two,three}/[01-03]bar 0 -120 foo[2-13]/[01-12]bar 0 -120 foo[2-13]/[1-12]bar 0 -90 foo[2-13]/[1-9]bar 0 101 foo[0000-100]bar 0 41 foo[0000-40]bar 0 -10 foo[8-19]bar 0 -399 foo[600-1000]bar 1 0 foo{011-047}bar [download]	[reply] [d/l] [select]
Re: Re: Possibly long lists and memory leaks by waswas-fng (Curate) on Dec 03, 2002 at 19:23 UTC
Instead of limiting the number of items and size, you may also be able to tie an array to a file -- this way it does not force it to load the whole array into memory and you dont have to set as much of an artifical limit on the input. -Waswas	[reply]