Without the code, you're pretty much going to get generalities. Having said that though, the most likely one that comes to mind given the context is that you might be using HTML::TreeBuilder to do the analysis. It uses circular references which won't get correctly garbage collected unless you call the delete method on the instance.
| [reply] [d/l] |
There could be circular references that the garbage collector can't reclaim. Maybe Devel::Cycle can help you finding them.
| [reply] |
| [reply] [d/l] |
I ran into a similar, seemingly unavoidable problem with memory consumption when I was facing a huge number of Excel files, and decided to use Spreadsheet::ParseExcel to normalize/condense/combine the data from all of them. For each new Excel file that I opened, read, processed and closed, the module just kept taking up more memory, instead of re-using the space that was allocated for a previous file.
I decided to do a work-around, whereby I would process files until some reliable event occurred (e.g. changing directory, because there were never too many files in a single folder), write a "checkpoint" file to indicate how far I had gotten in the overall list, and exit. On start-up, the script would read the checkpoint file to figure out which directory to do next. Then it was just a matter of putting the script in a shell loop, running it enough times to cover the whole set.
In your case:
- Does the database provide info that you need in order to decide which web pages to get? If not, segregate the LWP/HTML::Parser part from the MySQL part -- those two parts don't need to be in the same script. The page-fetch script could just output a tab-delimited text file, which could be loaded to the database via LOAD DATA INFILE.
- If the page fetch does depend on stuff being fetched from the database, you should still separate the LWP and html parsing to a separate process that just does one page at a time, and run this as a child of the MySQL process at each iteration. In this case, a script that takes a url as a command-line arg, and prints string data suitable for mysql insertion to its STDOUT, could be run via back-ticks or via open( PROC, "-|", $script_name, $url );
Either way, most of your trouble comes from trying to do too much in one huge monolithic script. Break it down into simpler components -- that's likely to improve performance in a lot of ways, and will make it easier to maintain; it's a win-win approach. | [reply] [d/l] |
Thank all for quick helps. Bellow is my report on the question.
talexb, I suspect my Perl program consumed my memory by using 'free -m' to look at the free memory. When I ran the Perl program, free memory decreased very fast and did not release after the Perl program stopped.
Fletch, you got the point. I forgot to delete the tree. Since I called HTML::TreeBuilder many times, that caused a serious memory wastage. After I deleted the tree, the memory leaking was almost solved.
When I say 'almost', I mean there is still very slow memory leaking, like 1M bytes several minutes. graff is right, the trouble comes from my large script (1305 lines :P). I should break the script into smaller components.
I didn't try Devel::Cycle and Test::Memory::Cycle, since I did not have complex reference structures.
| [reply] |
| [reply] |
What evidence do you have to support the hypothesis that your Perl program is causing the memory leak? Is your program running as a daemon? Can you disable or mockup portions of your program and see if the memory leak persists or goes away?
I almost never worry about undefing variables to free them up -- I just allow them to fall out of scope, and Perl does the rest.
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
| [reply] [d/l] |
The only time that I was bitten by a memory leak in Perl was when constructing a recursive function out of a an anonymous sub reference, which is/was addressed by Sub::Recursive.
Every other time I ran into something like this, it was a mistake or code I wrote that was worse than usual.
The best thing to do is to create the smallest, simplest snippet of code that demonstrates the leak. This would serve as your "evidence". | [reply] |