in reply to Re: Are there any memory-efficient web scrapers?
in thread Are there any memory-efficient web scrapers?

I'm only requesting html documents, so I added a handler to prevent downloading response content if the content type wasn't text/*. But I didn't think to monitor the size, so I'll set the max_size now. But I still think I need to move to something that can scale better. I was hoping something already exists, but I'm up for hacking on an AnyEvent or POE solution that incrementally parses the HTML, as it comes in or from file, with HTML::Parser OR XML::LibXML.
  • Comment on Re^2: Are there any memory-efficient web scrapers?

Replies are listed 'Best First'.
Re^3: Are there any memory-efficient web scrapers?
by Anonymous Monk on Aug 13, 2011 at 20:27 UTC

    solution that incrementally parses the HTML

    How do you know this is the bottleneck?

      Bottleneck? By that I assume you are referring to processing speed. That is not my primary concern, and I made no mention of that in my question. I am concerned about memory usage when the scraped pages are parsed for forms and links.

        Bottleneck? By that I assume you are referring to processing speed.

        No. We're talking about memory usage. How did you determine the link parsing porting, is responsible for your 200MB process size? And that a solution that incrementally parses the HTML is the answer to reducing memory usage?