Godsrock37 has asked for the wisdom of the Perl Monks concerning the following question:

I'm having an issue keeping the memory usage of a web spider i wrote down. It runs fine without any issues or bugs except that after a few hours the computer runs out of memory and i get an error that there isnt enough available memory to allocate what i need. I'm running on Windows XP with 4 GB of RAM, watching task manager shows that it steadily climbs

i'm using threading to keep the program running at a decent speed (20 threads) and i spawn threads that detach to do mysql statements as well

i've looked everywhere for help on this issue including here, and seen that there were bug fixes in past versions of perl but it seems like it is a persistent issue.

im using perl 5.10 and threads 1.67

i've told threads to make the stack size 4096 and i've even explicitly declared the detached mysql threads undef when they're done. Final output is as follows:

Out of memory! Callback called exit at C:/perl/lib/HTML/Element.pm line 234. Callback called exit at C:/perl/lib/HTML/Element.pm line 234. Perl exited with active threads: 19 running and unjoined 0 finished and unjoined 2 running and detached

the full program is 700 lines so i've only posted how i spawn the threads

use threads ('stack_size' => 4096); use threads::shared; use threads qw(yield); use Thread::Queue; my $sched = new Thread::Queue; my $parser_thread1 = async { main_loop();}; my $parser_thread2 = async { main_loop();}; my $parser_thread3 = async { main_loop();}; #goes to 20, not elegant but it works $parser_thread1->join; $parser_thread2->join; $parser_thread3->join; #goes to 20 as well sub main_loop { while( schedule_count() and $hit_count < $hit_limit #could be commented out and time() < $expiration and ! $QUIT_NOW ) { yield(); process_url( next_scheduled_url() ); } return; } my $mysql_thread = threads->create('send_parts_to_db', $parts)); $mysql_thread->detach; undef $mysql_thread; #as suggested by someone on perlmonks

As shown i use the same mysql_thread variable for all mysql insertions... but if i ever get an error i can have up to thread # 170+ Is that normal?

Replies are listed 'Best First'.
Re: Threads memory consumption is infinite
by BrowserUk (Patriarch) on Jun 10, 2008 at 15:22 UTC

    Based purely upon your description, it sounds like you mysql threads are failing to get cleaned up.

    In general, using multiple threads with DBI has been a no-no, because the third-party (vendor) libraries that underly many DBI/DBD implementations are not thread-safe and allocate resources on a per-process basis.

    You can in most cases successfully use DBI from a multi-threaded app, but the safest way to do so, is to start a single, long-running thread that conducts all the interactions with the DB. When other threads in your application need to make DB calls, they should communicate their requirements to, and retrieve results from the single DBI thread via queues or other shared-memory constructs.

    The best way to test this hypothesis would be to run a copy of your app that does everything that it does now, including spawning and ending the "DBI" threads, but comment out/remove the actual DBI code. If the app runs without accumulating dead threads once they are no longer actually making DBI connections, it's a fairly clear indication that DBI, or the underlying vendor libraries are failing to clean up and releae resources.

    Beyond that test, the best approach to solving the problem is to create a vastly cut down version of your app that 'goes through the motions', without actually doing to much--reducing the code to the bare minimum that demonstrates the problem, and then post that here so that here so that we can advise further. It may also allow the raising of a bug report that might allow someone to see and fix a problem.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I commented out the DBI thread creation part as recommended and the problem persists. I also explicitly undefed 2 rather big hashes when i was done with them.

      it may be doing a little better but not enough that it's a 'solution' per se. it just took longer to die if there was any change at all

Re: Threads memory consumption is infinite
by zentara (Cardinal) on Jun 10, 2008 at 16:31 UTC
    I'm just brainstorming here, from other threads experience. A thread gets a copy of the parent when it gets spawned, and this is the cause of all sorts of thread safety difficulties,

    Now just glancing at your pseudocode, you are first creating up to 20 parser threads BEFORE you spawn the $mysql_thread, so the $mysql_thread gets a copy of all of it. Possibly you are getting recursion in parser thread creation also? What sequence would add up to 170? Does the second parser thread get a copy of the first, etc.

    Maybe try to spawn your $mysql_thread BEFORE you create your 20 parser threads? Also can you do

    $mysql_thread->kill(’SIGUSR1’); undef $mysql_thread;
    possibly $threads->exit can be used in the thread to ensure it returns, so it can close itself up.

    But like BrowserUk suggests, your best bet is to simplify it down to a testable example, without mysql involved, and see how it behaves.


    I'm not really a human, but I play one on earth CandyGram for Mongo

      watching task manager i see that the mysql threads do successfully die and there are never more than 25 threads or so created (20 parsers, 1 main, and 2-4 mysql which fluctuate depending on performance)

      i told it to die after about 15 minutes (all that it can last) and as each thread exits it frees up 300-400MB approximately which tells me that each parser thread is accumulating the memory, and not the mysql threads or even the main thread that spawns the parsers... hopefully that was a coherent thought

        hopefully that was a coherent thought

        heh heh :-)


        I'm not really a human, but I play one on earth CandyGram for Mongo
Re: Threads memory consumption is infinite
by perrin (Chancellor) on Jun 10, 2008 at 17:31 UTC
    Are you using WWW::Mechanize? It keeps the full text of all pages it hits in memory. Instructions for disabling this are in the FAQ.
      im using html tree builder which has a delete function that i've implemented
Re: Threads memory consumption is infinite
by Godsrock37 (Sexton) on Jun 10, 2008 at 18:14 UTC

    i think i've deduced that its in those 20 static threads... but im having trouble testing it

    i need to be able to do something along the following:

    sub main_loop { my $parser_thread; while( schedule_count() and $hit_count < $hit_limit #could be commented out and time() < $expiration and ! $QUIT_NOW and scalar threads->list(threads::running) <= 25 ) { #yield(); $parser_thread = async { process_url( next_scheduled_url() );}; } $parser_thread->join; return; }

    the problem is that it creates the 20 threads and then joins them and thats the end... i want it to create 20 threads or so and as one dies off it spawns a new one... any thoughts? i feel like this is really close

      This is starting to ring a bell, I've seen this with Tk, you need to reuse your threads, because the refcount is so complicated, that Perl won't free the thread's memory when it's undefind. What I do in Tk-with-worker-threads is create 3 reusable worker threads( you would want to create 20). My example looks complex because I use a bunch of hash names, but you can simplify it for your purposes. What I essentially do, is store the available threads in an array @ready. I shift off a thread as I need a thread to work, and when it is done working, I push it back onto the @ready array for next use. This way, only 20 threads ever get created, and only 20 max can run at a time. It works to conserve memory by reusing the threads. You only need to join the threads once, when exiting the program.

      So instead of spawing a new one as an old one dies off, push the dying one onto the @ready array, and shift one off for the next thread. Believe me, it works and is solid, and avoids the refcount problem.( To avoid confusion, when you get a worker, a little popup appears that spawns an xterm, just close the xterm to start the thread running. On win32 change the cmd to something that works. I did this to give a visual indication that the thread was actually doing something).


      I'm not really a human, but I play one on earth CandyGram for Mongo

        quick question... how do i start/stop each thread?

        in other words... so the thread finishes its job and i will immediately have another job for it. i dunno, i like the @ready strategy and reusing the threads, that seems like its exactly what i need, but i dont know how to implement it.

        i looked at your code but, like you said, its a little more than i need. i dont need to share any data between threads except the queue which already works

        can i send you a PM or an email or something? i love perlmonks but its a little bit of a low bandwidth form of communication

Re: Threads memory consumption is infinite
by Godsrock37 (Sexton) on Jun 10, 2008 at 17:06 UTC

    unfortunately i can't post all of the code because of company policies on the code etc. etc.

    in the future i may be able to post a slim downed version though im not sure how much it would help, i've supplied everything having to do with threading except for the queue which is pretty straightforward. i think this is an applied theory issue where i just misunderstood something about threading or how to manage memory

Re: Threads memory consumption is infinite
by Godsrock37 (Sexton) on Jun 12, 2008 at 18:33 UTC

    i finished my solution to the problem...

    sub main_loop { my $parser_thread; while( still_running() #and $hit_count < $hit_limit #could be commented out and time() < $expiration and ! $QUIT_NOW ) { #if we have less than 23 (including mysql) then make a new one if (scalar threads->list(threads::running) < 23 and schedule_count +()){ $parser_thread = async {process_url( next_scheduled_url() );}; } #cleanup dead threads if (threads->list(threads::joinable)){ foreach $parser_thread (threads->list(threads::joinable)){ $parser_thread->join; undef $parser_thread; } } } return; } sub still_running{ if (schedule_count()){return 1;} else { if (scalar threads->list(threads::running)){return 1;} else {return 0;} } }

    The leak is either non-existent or drastically reduced. Thanks for the help everyone