carthag has asked for the wisdom of the Perl Monks concerning the following question:

I'm (re)writing a module (and will eventually put it on CPAN, when the bugs are sorted out) that generates URL-lists based on user input, since the previous version was using up horrible amounts of memory (I've had to kill httpd processes with VSIZEs of over 2 gigs).

However, it did not help putting in a 'maximum' list-length, as I had expected. I tried letting it run over the night, and got the same problem again... Does anyone have an idea?

The module is used as follows:

my $uri = $q->param('uri')||""; my $uriobj = URI::Fusker->new(uri => $uri, maximum => 100); my @urls = $uriobj->as_list; #After which @urls is popped into a template, printed, etc

And here is the module code:

package URI::Fusker; #copyright 2001, 2002 carthagtuek@softhome.net ##################################################################### use strict; ##################################################################### require Exporter; our @ISA = qw(Exporter); our @EXPORT = qw(); our @EXPORT_OK = qw(length generateList generateSequences generateRa +nges); our %EXPORT_TAGS = ( All => [qw(length generateList generateSequences +generateRanges)] ); our $VERSION = 1.0; ##################################################################### sub new { my $class = shift; my $self = { uri => "", maximum => 10000, #Default maximum length for the gene +rated list @_ }; return bless $self, $class; } ##################################################################### sub as_list { my $self = shift; if ($self->length <= $self->{maximum}) { return generateList($self->{uri}); } else { return; #Todo: Make a shorter list. } } sub uri { my $self = shift; if (@_) { $self->{uri} = shift } return $self->{uri}; } sub maximum { my $self = shift; if (@_) { $self->{maximum} = shift } return $self->{maximum}; } ##################################################################### sub length { my ($self, $uri); my $length = 0; if (ref($_[0])) { $self = shift; $uri = $self->{uri}; } else { $uri = shift; } while ($uri =~ m/((?:\{[^\}]+\})|(?:\[[^\]]+\]))/ig) { if ($1 =~ m/\{([^\}]+)\}/i) { $length++; $length++ while ($1 =~ m/,/); } elsif ($1 =~ m/\[([^\]]+)\]/i) { my ($start, $stop) = sort split('-',$1); $length += $stop - $start + 1; } } return $length; } ##################################################################### sub generateList { my @urls = @_; while (grep /\{[^\}]+\}/, @urls) { #as long as there are {}s in the url list, split out the sequ +ences. @urls = generateSequences(@urls); } while (grep /\[[^\]]+\]/, @urls) { #as long as there are []s in the url list, split out the rang +es. @urls = generateRanges(@urls); } return @urls; } sub generateSequences { my @strings = (); while (@_) { my $string = shift @_; $string =~ /^([^\{]+)\{([^\}]+)\}(.*)$/; #something{somethi +ng}something my $begin = $1||''; my $mid = $2||''; my $end = $3||''; if ($mid) { #there were a set of {} my @items = split(',',$mid); foreach (@items) { #Taking each item and putting it in the list. push @strings, "$begin$_$end"; } } else { #there weren't, let's just push the string back out again push @strings, $string; } } return @strings } sub generateRanges { my @strings = (); while (@_) { my $string = shift @_; $string =~ /^([^\[]+)\[([^\]]+)\](.*)$/; #something[somethi +ng]something my $begin = $1||''; my $mid = $2||''; my $end = $3||''; if ($mid) { #there were a set of [] my ($start, $stop, undef) = sort split('-',$mid); my $digits = 0; if ($start =~ /\d+/) { #There is one or more digit, so we'll remove anything + else that might be there $digits = length($start); $start =~ s/\D+//g; $stop =~ s/\D+//g; } #Catching crap input here: $stop = substr("000000$stop",-length($start)) if ($stop =~ + /^0/); #Make sure they are the same length, eh foreach ($start .. $stop) { #Taking each item and putting it in the list. push @strings, "$begin".("0"x($digits-length($_)))."$_ +$end"; } } else { #there weren't, let's just push the string back out again push @strings, $string; } } return @strings } ##################################################################### 1; __END__ =head1 NAME URI::Fusker - URL generation based on ranges and sequences =head1 SYNOPSIS $uri = URI::Fusker->new("http://www.server.com/directory/{a,b,c}/[1-1 +0].html"); @urls = $uri->as_list; =head1 DESCRIPTION This module emulates the part of the behaviour of the commandline tool + cURL. In particular, it makes possible URL generation based on ranges and sequences as foll +ows: =over 3 =item * An URL containing a range like [1-10] will be expanded to a number of +URLs, each with a number from the range. =item * An URL containing a sequence like {a,b,c} will be expanded into a numb +er of URLs, each with an item from the sequence. =back Thus, in the example seen in L</SYNOPSIS>, 30 URLs will be generated. =head2 Methods =over 3 =item new( [uri => ""], [maximum => 10000] ) Creates a new URI::Fusker object. The uri parameter is the uri from wh +ich to generate the list, and the maximum parameter is the maximum length (the default + is 10,000) of said list. =item $uri->as_list Creates the list using the associated uri, unless the list will be lon +ger than the set maximum. =item $uri->uri([$uri]) Sets or returns the uri for the object. =item $uri->maximum([$max]) Sets or returns the maximum length of the object. =item $uri->length Will attempt to calculate the length of the list that is to be generat +ed. =item generateList($uri) Will generate the list regardless of the length. Be careful, this can +be very memory intensive. It is suggested to only use L</as_list>. =head1 AUTHOR Mikkel Eriksen, carthagtuek@softhome.net =head1 SEE ALSO http://curl.haxx.se/ =head1 COPYRIGHT Copyright 2002 Mikkel Eriksen. This program is free software. You may copy or redistribute it under t +he same terms as Perl itself. =cut

Replies are listed 'Best First'.
Re: Possibly long lists and memory leaks
by carthag (Scribe) on Dec 03, 2002 at 16:51 UTC
    With thanks to tye and jdporter, the length method has been rewritten as follows:
    sub length { my ($self, $uri); my $length = 1; if (ref($_[0])) { $self = shift; $uri = $self->{uri}; } else { $uri = shift; } while ($uri =~ m/((?:\{[^\}]+\})|(?:\[[^\]]+\]))/ig) { if ($1 =~ m/\{([^\}]+)\}/i) { my $temp = $1; $length *= ($temp =~ tr/,/,/); } elsif ($1 =~ m/\[([^\]]+)\]/i) { my ($start, $stop) = sort split('-',$1); $stop = $start+10 unless ($stop); if ($start =~ /\d+/) { #There is one or more digit, so we'll remove anything + else that might be there $start =~ s/\D+//g; $stop =~ s/\D+//g; } $length *= $stop - $start + 1; } } return $length; }
    However, this still isn't entirely perfect. These are some examples from my current log that generate inconsistent results:
    scalar($uriobj->as_list) $uriobj->length $uri 1 0 foo{011-047}bar 0 -398 foo[800-1199]bar 0 -348 foo[850-1199]bar 0 -298 foo[900-1199]bar 0 50 foo[001-50]bar 0 -97 foo[2-100]bar 0 -97 foo[2-100]bar 0 -7 foo[2-10]bar 1 0 foo{001-188}bar 0 -140 foo[01-20]bar 0 -3 foo[1-05]bar 0 -46 foo[53-100]bar 975 -23 foo[001+-+025]bar 9 6 foo{one,two,three}/[01-03]bar 27 6 foo{one[01-03],two,three}/[01-03]bar 0 -120 foo[2-13]/[01-12]bar 0 -120 foo[2-13]/[1-12]bar 0 -90 foo[2-13]/[1-9]bar 0 101 foo[0000-100]bar 0 41 foo[0000-40]bar 0 -10 foo[8-19]bar 0 -399 foo[600-1000]bar 1 0 foo{011-047}bar
      Instead of limiting the number of items and size, you may also be able to tie an array to a file -- this way it does not force it to load the whole array into memory and you dont have to set as much of an artifical limit on the input.

      -Waswas