gatito has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a very simple multi-threaded spider that can go out and fetch links from a number of pages at once. My basic single threaded script works fine ( everything below the subroutine comment ) when it is it's own script.

However, when I call it as a job from a thread pool, I can't get an response via LWP::UserAgent on my requests. I originally abstracted away the subroutine stuff as it's own package, but that didn't work either ( I thought that might help if there is a scoping error ).

Given that I have no experience with multithreading over than playing around with a few of the modules and making threads print stuff concurently, am I doing something mind-numbingly stupid or is there something worse going on?

Thanks for advice in advance.
#!c:/usr/bin/perl # CPAN modules use Thread::Pool; use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; $pool = Thread::Pool->new( { optimize => 'cpu', # default: memory do => 'XXX', # must have checkpoint => \&checkpoint, frequency => 1000, workers => 5, # default: 1 maxjobs => 50, # default: 5 * workers minjobs => 5, # default: maxjobs / 2 }, ); # Start up some threads $pool->job( "http://www.yahoo.com" ); $pool->job( "http://www.eroticfalconry.com" ); $pool->join; # wait for all threads to finish #### subroutines sub XXX { $current_url = @_[0]; @queue = (); while( $current_url ne "" ){ @links = get("$current_url" ); print "links in url\n********************\n"; print join("\n", @links), "\n"; push(@queue,@links); #print "links in queue\n********************\n"; #print join("\n", @queue), "\n"; $current_url = pop(@queue); print ">>>>new url=$current_url\n"; #this is where we go next } } my $p = HTML::LinkExtor->new(\&callback); my @links = (); # Set up a callback that collect links sub callback { my($tag, %attr) = @_; return if $tag ne 'a'; # we only look closer at <a ...> push(@links, values %attr); } sub get { $url = "@_[0]"; print "Entering html::getlinks, url=$url\n"; my $ua = LWP::UserAgent->new; $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->pars +e($_[0])}); $base = $res->base; # Expand all URLs to absolute ones @links = map { $_ = url($_, $base)->abs; } @links; return @links; } $version = 1;

Replies are listed 'Best First'.
Re: Threads and LWP::UserAgent question
by ikegami (Patriarch) on Jun 09, 2008 at 05:14 UTC

    I can't get an response via LWP::UserAgent on my requests.

    What does that mean? I'm under the impression that request returns a response for both successes and failures.

    And how did you ascertain this conclusion?

    I thought that might help if there is a scoping error

    You have tons of those. use strict; will reveal them.

    My basic single threaded script works fine

    A quick look at LWP's and HTML-Parser's documentation shows no indication that they are thread-safe.

    LWP appears to be completely written in Perl, so it's *probably* safe.

    HTML-Parser, on the other hand, is partially written in C. It's much more likely for a C module to not be thread-safe (since variables are shared by all threads). What happens if you remove all references to HTML::LinkExtor (temporarily)?

      I can't get an response via LWP::UserAgent on my requests.

      I should have been a bit clearer. With a single-threaded app I have no trouble getting a response object and then getting it's content(the html of the page). In multi-threaded case I wasn't able to get the html at all, and after toying around with my code got a 400 error to appear.

Re: Threads and LWP::UserAgent question
by moritz (Cardinal) on Jun 09, 2008 at 04:53 UTC
    Two things came to my mind while reading your question:
    1. Did you try LWP::Parallel? It might be just the right tool for your job
    2. When working with threads, try to keep the number of global variables minimal, that is declare all variables with my where possible. Using strict helps you to detect cases where you forgot some declarations. Global variables have to be cloned for each threads, which consumes quite some memory.

    Apart from that, using use strict; use warnings; also helps avoiding mistakes.

      When working with threads, try to keep the number of global variables minimal

      Why? While that's good advice in general — it's easier to make sure that local variables are in the proper state than it is for global variables — I don't see why it's important for threads specifically. I can see this being an issue for *shared* variables (since their access needs be controlled by locks), but other package, lexical, global and local variables are per-thread by default.

        Why?

        Because global variables have to copied for each created thread. Admittedly it's not such a big problem as with shared variables, but if you're careless they can cummulate to quite some size.

        One of the reasons that perl threads aren't exactly lightweight is that there are just too many global variables (think $_ $/ $\ $, $; @ARGV  @INC ...) which all have to be copied for each created thread. And it's one of the reasons why Perl 6 tries to reduced the number of global variables wherever possible, and invents context variables instead.

        Specifically @links (in the root node) looks weird to me - it's not a shared variable, but it's still a global variable (by virtue of being mentioned in sub XXX), and later declared as lexical in the main program. @queue is also global, although used only in XXX.

        If I ran my subroutines as external modules, wouldn't that make the global variable issue a non-issue? Doesn't each call from the top level live in it's own thread?
      1) I saw the module and briefly read about it but have not tried to use it. I suppose it might work, but I took my route as I figured it would be more robust ( with multithreading - ha! ) once I scale up to 10+ spiders each accessing a database.

      2) Will do
Re: Threads and LWP::UserAgent question
by BrowserUk (Patriarch) on Jun 09, 2008 at 05:53 UTC
    am I doing something mind-numbingly stupid

    Unfortunately yes. You made the mistake of using a CPAN module, Thread::Pool, that should have been strangled at birth. Try contacting the author/maintainer for support.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Threads and LWP::UserAgent question
by misc (Friar) on Jun 09, 2008 at 13:12 UTC
    Endangering myself getting strangled..

    Some time ago I wrote my own threadpool implementation..
    I was curious whether my implementation has the same problems..
    I cannot tell why, but the version below works as expected,
    while some mysterious things (I also didn't look into the sources) seem to happen in Thread::Pool.

    I also added a small delay between the requests.

    michael

      Your module certainly contains less puerile cruft (optimise=>cpu/memory which does and has never done anything; the whole Thread::Conveyor 'abstraction to far' and much more), than Thread::Pool.

      Minus all that cruft, your module stands at least a fighting chance of being debuggable (Ie. maintainable).

      It does however exhibit a few um...peculiarities. Like

      while ( 1 ){ { lock ${$self->{working threads}}; my $pending; { lock ${$self->{poolcount}}; $pending = ${$self->{poolcount}}; } if ( ${$self->{working threads}} + $pending > 0 ){ cond_wait ${$self->{workingthreads}}; } else { return; } } }

      I assume that the inner anonymous block is intended to restrict the range of effect of the lock statement, but given there is nothing outside of that anonymous block, within the encompassing while block, it is entirely superfluous.

      In new(), you have ${$self->{poolcount}} = 0;. At other points in the code you have lock ${$self->{poolcount}}; and ${$self->{poolcount}} --; and cond_signal ${$self->{poolcount}}; etc.

      The elements of hashes are scalars. Why are you autovivifying scalar references into those scalars? As best as I can tell, you're just complicating the code and making it run more slowly.

      These and similar constructs are rife throughout the module. You also appear to be doing far more locking than is necessary, but that's instinct rather than careful analysis talking.

      But the main problem with your modules is that (on the basis of inspection at least), it appears to have one major, fundamental flaw in common with Thread::Pool. It completely misses the point of "pooling threads". Ithreads are relatively expensive to both start and destroy--a side effect of emulation forking. Badly.

      The benefit of using a "pool of iThreads", is that you avoid those overheads, by starting long-running threads that loop over a shared source of 'work tokens', (usually by polling a mod::Thread::Queue) and 're-using' existing threads to perform repetitive tasks.

      As best I can tell without having run your code, in common with Thread::Pool, you are discarding old threads and starting new ones for each repetition of a basic task. Like disposable nappies, this may be convenient, but it is wasteful of resources.

      In addition, there are immense problems with the implementation of the OPs code--by the OP and by cut&paste, by your "test code". These mostly revolve around--as correctly identified by moritz--the use of global, and package global variables which appear to allow the main thread to share the links scavenged, but in reality result in a bunch of thread-uniquely accessible lists, which no other thread can ever access, and which get discard when you throw away the originating thread.

      In essence, as coded, all of the work done to extract the lists of links in the threads is entirely wasted, because they are deposited into a CLONED, non-shared, thread-specific piece of storage, that gets discard as soon as the work has been done to build it.

      Sorry to be negative, but as it stands, your 400 line module could be better replaced with 15 lines of direct threads code. Better, because the result would be a program that worked, rather than just ran.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Yes, there are probably too many locks.

        When I wrote this module some time ago I had to struggle with deadlocks.
        So I finished with locking all variables always in the same order,
        also in mind it would be easier to insert new functions into the code.

        When the whole thing worked, I didn't mind to clean up..

        At the place you mentioned: Do you mean cond_wait would also unlock ${$self->{poolcount}} or that locking ${$self->{poolcount}} is simply unneccessary ?
        I didn't trie so much to have few locks, would you say that a lock is big penalty ?

        I used the hash $self everywhere because I had in mind to be able to have more than one threadpool in an script.

        You are however wrong about the creating of the threads,
        the module doesn't destroy any thread at all.
        Instead it creates up to maxthread threads and puts the work recieved with enqueue() onto a queue.
        The threads pull in T_thread() the work of the queue.

        I wrote this module, because I had some problems with growing memory of long running scripts, due to creating and destroying threads.

        The OP's code..
        The only thing I've changed has been the small delay (select undef,undef,undef,0.2;).
        Since the OP didn't say exactly what he tries to do and where all the data should go, I didn't had a closer look, instead I really did copy'n paste.

        I've just been curious whether I could replace Thread::Pool with my module,
        since this worked without problems I thought my implementation could be helpful.
        There's also the point that the feedback, especially yours, helps me improve my code.

        So, no, I don't feel you would be negative.
      Huh, your code works fine on my PC desktop, and doesn't work on my pc laptop. You should submit that to CPAN as a replacement for the current garbage.
        What do you mean with it doesn't work on your laptop ?
        Are there any error messages ?