in reply to Are there any memory-efficient web scrapers?

Reading between the lines of your post and making a few assumptions, I think I would go for a different architecture than the one you describe for this.

Rather than having multiple, all-in-one fetch-parse-store processes, I'd split the concerns into three processes.

  1. A fetch and store to files in a known directory process.

    Unless you need the extras that Mechanize gives you, I'd use the (much) lighter LWP::Simple::getstore() for this. One instances per thread; two or three threads per core; feeding off a common Thread::Queue can easily saturate the bandwidth of most connections in (say) 100 MB.

  2. A single script that monitors the inbound file directory

    would spawn as many concurrent copies of ...

  3. A simple, standalone parse-a-single-HTML-file-and-store-the-results process.

    ... as are either a) required to keep up with the inbound data rate; or b) the box can handle memory and/or processor wise.

    The monitor script could be based on threads & system or a module like Parallel::ForkManager depending upon your OS and preferences.

This separation of concerns would allow you to easily vary both the number of fetcher threads; and the number of HTML parsers; to allow you to match them to the bandwidth available and the processing power and memory limits of the box(es) this will run on; whilst keeping each of the 3 components very linear, and easy to program.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Are there any memory-efficient web scrapers?
by Anonymous Monk on Aug 14, 2011 at 07:02 UTC
    When I scrape certain urls, I have to submit the form if found on the page. Separating into separate processing steps will drastically complicate that process, since not only will the content have to be saved to process later, but the entire response so I can reuse the headers. Even then, that might break if the web server is using sessions and the session expires before I can process it.

      I'm not aware of any such scraper. I would first try to subclasss WWW::Mechanize to use some event-based parser or even regular expressions to extract the forms from the response. To save more memory, either do the parsing in the :content_cb callback directly, or store each page to disk and then separately parse the content from there again, either for forms, or for data.

      The current trend within WWW::Mechanize skews somewhat towards using HTML::TreeBuilder for building a DOM, but if you have proposals on how an API would look that sacrifices the content for less memory usage, I'm sure that I am interested, and maybe other people are interested as well.

      One thing I could imagine would be some kind of event-based HTML::Form parser that sits in the content callback of LWP, so that WWW::Mechanize (or whatever subclass) can extract that data no matter what happens to the content afterwards. But I'm not sure how practical that is, as the response sizes I deal with are far smaller.

        Great suggestion. :content_cb + incremental parser sounds like a win for my situation.

      Fair enough. Though that sounds more like driving interactive sessions than "web scraping".


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.