WWW::Robot memory management

imagestrips has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have been trying to use www::robot but I seem to be facing a memory management problem. I am actually letting the robot to traverse my web site ( a large-ish one > 10000 pages) and by the time it reaches ~3000 pages the memory reserved is nearly 1GB (!). I am using the standard way of invoking it as per its own documentation. I.e.

$robot = new WWW::Robot ( 'NAME' => 'blabla',
                          'VERSION' => '123456',
                          'EMAIL' => 'me@home.com');

$robot->run($rootDOcument);
[download]

..and I have defined the necessary hooks which work successfully. I have faced the incompatibility with the latest version of WWW::RobotUA and dealt with it by hardcoding the RobotUA 'FROM' to an email address (..I know I shouldnt but I am working on tight timelines :-( ). Does anybody have any experience using this module? Any ideas on a way forward?

Comment on WWW::Robot memory management Download Code

Replies are listed 'Best First'.
Re: WWW::Robot memory management by talexb (Chancellor) on Jan 17, 2006 at 18:15 UTC
I don't have experience with this module, but I'm curious as to why you need to look at the entire site in one shot. Is there any way you could break it up and search separate sections of the site, possibly combining the results after all of the searches have completed? Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re^2: WWW::Robot memory management by imagestrips (Initiate) on Jan 17, 2006 at 18:28 UTC
Indeed, I have tried that approach too by using the save-state and restore-state hooks but it failed because - a)I think that I might have misunderstood what constitutes the state of the machine (list of urls visited and urls to be visited?) - b) the methods available do not provide a comprehensive interface to the robots state and c) I am dealing with a Domino R5 site that does not exactly lend itself to rational partitioning - it has very unfriendly and seemingly unstructured URLs (therefore not allowing me to devise an adequate segmentation method)	[reply]