I would suggest you don't re-invent the wheel and use WWW::Mechanize to get and parse the HTML.

As for threads, you may or may not find them useful. Perl threads do have issues (for instance, you cannot share most objects, and it may or may not be possible to use a shared socket, depending on your code).

A relatively simple and robust way of making something like this threaded is to use Thread::Queue, shove the starting urls or user names in the queue and have a few worker threads - each with their own WWW::Mechanize object - that pop the urls from the queue, parse the information and push the results on another Thread::Queue that can then be read by the "main" thread.

update: now that you've erased your original question, it's kind of hard to discuss it.

1. The big advantage with WWW::Mechanize is that it abstracts away all the cruft you don't want to think about when building web crawlers (like, how to robustly match HTML links, fill in forms, find images, etc). Most of those things are not too hard, but chances are very high you'll miss corner cases (for instance, HTML attributes may be single quoted, double quoted or unquoted, and may contain unescaped < and even > characters).

In any case, using WWW::Mechanize's forms() method gives you a much nicer interface to query the form(s) on a page.

2. If your code really doesn't need any sharing of information, you might as well use fork(). For a simpler interface you may want to check out Parallel::ForkManager.


In reply to Re: HTTP filtering and Threads... by Joost
in thread HTTP filtering and Threads... by danett

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.