cwchang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've beeing using WWW::Mechanize 0.71 and kept having memory leak problem. Here is an example script:
#!/usr/bin/perl use WWW::Mechanize; use Devel::Size qw(size total_size); my $agent = WWW::Mechanize->new(); while (1) { $agent->get(qq(http://www.yahoo.com)); print "agent size = ".total_size($agent)."\n"; }
This script will show that the size of the agent object keeps growing fast. Same can be verified using "top".
Changing the object type from WWW::Mechanize to LWP::UserAgent will immediately fix the leak but certainly lost all the wonderful functions Mechanize provides.
I've also tried the latest 0.71_2. Still leaks. I'm using RH9.0 with Perl 5.8.0 on my Linux box. LWP is version 5.76.
I would appreciate very much if any insight from the monks is provided. Thanks in advance.

Replies are listed 'Best First'.
Re: WWW::Mechanize memory leak???
by ViceRaid (Chaplain) on Jan 07, 2004 at 17:41 UTC

    Afternoon

    Yeah, I get the same results (perl, v5.8.0 built for i386-linux-thread-multi; WWW::Mechanize 0.70). Like Roy Johnson suggested above, it's because WWW::Mechanize keeps a list of HTTP results in a page stack. Whenever it starts to get a new page, it stores the last response it received in an array.

    If this is a problem for you - for example, if you've got a long running process and it's getting too fat - you should create a subclass of WWW::Mechanize that keeps a limit on the size of the page stack, perhaps by redefining the _push_page_stack method:

    package WWW::Mechanize::KeepSlim; our @ISA = qw/WWW::Mechanize/; sub _push_page_stack { my $self = shift; if ( $self->{res} ) { my $save_stack = $self->{page_stack}; $self->{page_stack} = []; push( @$save_stack, $self->clone ); # HERE! - stop the stack getting bigger than 10 if ( @$save_stack > 10 ) { shift(@$save_stack); } $self->{page_stack} = $save_stack; } return 1; } package main; my $agent = WWW::Mechanize::KeepSlim->new(); # ....

    If you use this class with your example that demonstrates the problem, you should see the memory usage increase arithmetically for the first 10 requests, then stop increasing.

    cheers
    ViceRaid

      It might be more useful to remove element base on its age. As the frequency you accessing web is not the same through out the day, and when you are busy surfing, you don't want the history get cleaned up quicker than the time you are mainly idle.

      You may modify the structure of $self->{page_stack} a little bit, so that a time stamp is kept, and only those ones older than a certain age will get deleted.

      However, as you are subclassing, it is probably a better idea to keep $self->{page_stack} as it is, and add a new array $self->{page_time_stamp}. The elements of those two arrays match 1-on-1.

      The performance for deleting would be good, as the ones need to be deleted always stay together at the beginning of the array.

      For future reference: You can set the max. stack depth with $mech->stack_depth(100) now:
      =head2 $mech->stack_depth( $max_depth ) Get or set the page stack depth. Use this if you're doing a lot of page scraping and running out of memory. A value of 0 means "no history at all." By default, the max stack depth is humongously large, effectively keeping all history.
      Hi, ViceRaid,

      Thank you very much for your explaination and code. It worked. I really appreciate.

      CW

Re: WWW::Mechanize memory leak???
by Roy Johnson (Monsignor) on Jan 07, 2004 at 16:55 UTC
    I cannot replicate your problem on ActivePerl 5.8.0, with WWW-Mechanize 0.48 installed. I had thought it might be the fact that Mechanize keeps a history of where you've been, but I don't know what kind of growth you're looking at.
    agent size = 110364 agent size = 110537 agent size = 110537 agent size = 110537 agent size = 110537 agent size = 110537 agent size = 110537 agent size = 110537

    The PerlMonk tr/// Advocate

      Right guess - the history feature was causing the problem, but from the Changelog for WWW::Mechanize it looks like that the page-stack feature might have been broken in 0.48 (see entry for 0.59).

      cheers
      ViceRaid

Re: WWW::Mechanize memory leak???
by RMGir (Prior) on Jan 07, 2004 at 18:39 UTC
    Assuming it's a history issue, could you rework your script to create a new agent for every loop?

    i.e.:

    #!/usr/bin/perl use WWW::Mechanize; use Devel::Size qw(size total_size); while (1) { my $agent = WWW::Mechanize->new(); $agent->get(qq(http://www.yahoo.com)); print "agent size = ".total_size($agent)."\n"; }
    I admit, I have no idea how fast or slow the WWW::Mechanize constructor is.

    Mike
Re: WWW::Mechanize memory leak???
by Steve_p (Priest) on Jan 09, 2004 at 03:27 UTC

    I don't believe it is leaking memory? Here's a quote from the perldoc for WWW::Mechanize.

    Mech also stores a history of the URLs you've visited, which can be queried and revisited.
    Also, here's the output I get from your program.

    agent size = 123086
    agent size = 245249
    agent size = 367383
    agent size = 489517
    agent size = 611651
    

    This shows me that the size is going up by an almost constant size each loop. I then modified your program to use Data::Dumper and spit out the results after five GETs. Here's the code.

    #!/usr/bin/perl -w + use strict; use WWW::Mechanize; use Devel::Size qw(size total_size); use Data::Dumper; + my $agent = WWW::Mechanize->new(); + for(my $i = 0; $i < 5; $i++) { $agent->get(qq(http://www.yahoo.com)); print "agent size = ".total_size($agent)."\n"; } + print Dumper($agent);

    Looking at the output, there is a data structure that is part of the module called the page_stack. I'm guessing, that the implementation of the back() method uses the page_stack. That way, rather than re-requesting the pages, the pages are just reloaded from memory. I don't think this is a leak. This is just appears to be the functionality of the module.