I ran into a similar, seemingly unavoidable problem with memory consumption when I was facing a huge number of Excel files, and decided to use
Spreadsheet::ParseExcel to normalize/condense/combine the data from all of them. For each new Excel file that I opened, read, processed and closed, the module just kept taking up more memory, instead of re-using the space that was allocated for a previous file.
I decided to do a work-around, whereby I would process files until some reliable event occurred (e.g. changing directory, because there were never too many files in a single folder), write a "checkpoint" file to indicate how far I had gotten in the overall list, and exit. On start-up, the script would read the checkpoint file to figure out which directory to do next. Then it was just a matter of putting the script in a shell loop, running it enough times to cover the whole set.
In your case:
- Does the database provide info that you need in order to decide which web pages to get? If not, segregate the LWP/HTML::Parser part from the MySQL part -- those two parts don't need to be in the same script. The page-fetch script could just output a tab-delimited text file, which could be loaded to the database via LOAD DATA INFILE.
- If the page fetch does depend on stuff being fetched from the database, you should still separate the LWP and html parsing to a separate process that just does one page at a time, and run this as a child of the MySQL process at each iteration. In this case, a script that takes a url as a command-line arg, and prints string data suitable for mysql insertion to its STDOUT, could be run via back-ticks or via open( PROC, "-|", $script_name, $url );
Either way, most of your trouble comes from trying to do too much in one huge monolithic script. Break it down into simpler components -- that's likely to improve performance in a lot of ways, and will make it easier to maintain; it's a win-win approach.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.