in reply to Controlling file size when writing

Many of the logging modules, such as Log4Perl, include provisions to rotate logs when they reach a certain size.

You say running the script for long periods results in large log files ... for what value of "long period"? If it's several weeks, normal log module/rotatelog practice of rotating logs daily and keeping the last four or five days, will be quite sufficient and more economical. If "long periods" means more than ten minutes, perhaps you should re-evaluate what is logged, and what is not. Obviously you need different levels of logging when things are runnning well, and when you are tracking down a bug; I suggest using a logging module.

Update: Have the server log whether each URL was successfully processed or not. Then you can go back and run the script manually with mega- mega- mega-MEGA debugging detail on the problem URL ... while the ordinary instances generate a few MB. You might have problem URLs saved to a special file, or the script could send you email with the details.

--
TTTATCGGTCGTTATATAGATGTTTGCA

Replies are listed 'Best First'.
Re: Re: Controlling file size when writing
by Vautrin (Hermit) on Mar 02, 2004 at 22:45 UTC
    Well, the process is multithreaded. So even though over a day I might generate 100 MB at most at the top logging level, after 40 - 50 forks we're talking about 4GB - 5GB a day. This has compounded the problem because I'm trying to keep the logs sorted and rotated. And, even though I can turn down the detail, the bugs I am finding require a high level of detail for testing.

    (The script is a web spider. Most of the bugs I encounter with it involve bizarre / broken HTML in web pages. Problem is that in order to figure out just what is going on I want to log lots of info if there are any anomalies. The problem becomes how to do that without being too processor intensive)

    Want to support the EFF and FSF by buying cool stuff? Click here.

      And this is why you need production and dev boxes. In the production box, you only need enough logging so as to know when something breaks. In your dev box, you can re-harvest the offending pages and see the errors in all its glory, including the ability to add instrumentation to the code on the fly.

      If you cannot have the two boxes, perhaps you can run a second instance of your spider manually, when needed. I bet this is less resource-intensive that managing such huge logs.

      That being said, take a look at Logfile::Rotate.

      Best regards

      -lem, but some call me fokat