Re^2: Are there any memory-efficient web scrapers?

When I scrape certain urls, I have to submit the form if found on the page. Separating into separate processing steps will drastically complicate that process, since not only will the content have to be saved to process later, but the entire response so I can reuse the headers. Even then, that might break if the web server is using sessions and the session expires before I can process it.

Comment on Re^2: Are there any memory-efficient web scrapers?

Replies are listed 'Best First'.
Re^3: Are there any memory-efficient web scrapers? by Corion (Patriarch) on Aug 14, 2011 at 07:25 UTC
I'm not aware of any such scraper. I would first try to subclasss WWW::Mechanize to use some event-based parser or even regular expressions to extract the forms from the response. To save more memory, either do the parsing in the `:content_cb` callback directly, or store each page to disk and then separately parse the content from there again, either for forms, or for data. The current trend within WWW::Mechanize skews somewhat towards using HTML::TreeBuilder for building a DOM, but if you have proposals on how an API would look that sacrifices the content for less memory usage, I'm sure that I am interested, and maybe other people are interested as well. One thing I could imagine would be some kind of event-based HTML::Form parser that sits in the content callback of LWP, so that WWW::Mechanize (or whatever subclass) can extract that data no matter what happens to the content afterwards. But I'm not sure how practical that is, as the response sizes I deal with are far smaller.	[reply] [d/l]
Re^4: Are there any memory-efficient web scrapers? by Anonymous Monk on Aug 14, 2011 at 07:39 UTC
Great suggestion. :content_cb + incremental parser sounds like a win for my situation.	[reply]
Re^3: Are there any memory-efficient web scrapers? by BrowserUk (Patriarch) on Aug 14, 2011 at 07:29 UTC
Fair enough. Though that sounds more like driving interactive sessions than "web scraping". Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]