All,
I was recently working on a web scraping project. Everything was working fine until I started to intermittently get "Out of memory!" errors or "This program has unexpectedly terminated" (windows error). I check the process table and the program isn't even using 1 MB of memory. I ran it again and this time it runs fine. WTF I say to myself.

I think perhaps since this is fetching data from dynamically generated web pages, I am intermittently getting some bad data. I capture all the data to local file before processing. This way, if I encounter the problem I should be able to reproduce it and then debug it. Ok great, except the same data doesn't produce the same results. Getting desperate, I decide to upgrade perl from 5.10.0 to 5.10.1 as well as all the modules I am using. Same problem. WTF I say again.

I start sprinkling my code with a whole bunch of print statements (poor man's debugger). I determine that a particular sub is entered but is not left. Ok, so what is going on in this sub - very short, looks straight forward and then I spot it:

my @clean = map clean_html($_), @dirty;

I look at clean_html() and it too looks harmless at first:

sub clean_html { my ($html) = @_; state $hs = HTML::Strip->new(); my $clean = $hs->parse($html); $clean =~ s/^\s+|\s+$//g; return $clean; }

I am wondering if this is a 5.10.x bug using state so I change it to my and the problems go away. Yay, isn't everyone going to be happy I helped find a bug. Let me go gather the information on HTML::Strip so I can include it in the bug report. Wait, what's this?

HTML::Strip maintains state between calls, so you can parse a document in chunks should you wish. If one chunk ends half-way through a tag, quote, comment, or whatever; it will remember this, and expect the next call to parse to start with the remains of said tag.

If this is not going to be the case, be sure to call $hs->eof() between calls to $hs->parse().

D'oh! The fix was simple - just add $hs->eof() in clean_html() and I could re-instate my state optimization. The reason it wasn't consistently reproduceable even with the same data is because I was populating @dirty from a hash and it wasn't always coming out in the same order. I spent far too much time debugging this simple problem that would have not existed if I had RTFM'd better. Any such stories you care to share?

Cheers - L~R

Old_Gray_Bear and toolic point out in a /msg that [doc://state] is not linking to the correct perl documentation. I am leaving the link as is since I believe this might be an issue with PerlMonks (see What shortcuts can I use for linking to other information? and How do I link to the Perl documentation?). The current working link is state.

Replies are listed 'Best First'.
Re: Another Reason RTFMing Is A Good Thing
by JavaFan (Canon) on Oct 19, 2009 at 15:49 UTC
    Note that you would have gotten the same problems if you wrote it without state:
    my $hs = HTML::Strip->new(); sub clean_html { ... }
    Personally, I wouldn't have (mis)used state for such an optimization (nor would I have written in the pre-5.10 style). You want to start parsing from a clean state, and as such, the code should reflect this. Your code doesn't.
      JavaFan,
      Yes, it would have produced the same unexpected results but I would have also had to move:
      my $hs = HTML::Strip->new();
      higher in the file since I prefer to list all my subs at the bottom rather than the top of my file.

      I am not sure I understand what you mean by (mis)using state. Is it not intended to ensure a variable is only initialized once? My expectation was that it was parsing from a clean state - primarily because I hadn't properly RTFMd - I didn't realize $hs was maintaining state.

      I guess what you are saying is that in the end, after I realized what was going on I should have made the code more clear. I think adding a comment would suffice:

      $hs->eof(); # ensure each call to parse() is from a clean slate

      That is in fact what I intend and what the module offers to allow it. If I have completely misunderstood you, please clarify. If you just disagree then I note your objection but do not agree with your position.

      Cheers - L~R

        I am not sure I understand what you mean by (mis)using state. Is it not intended to ensure a variable is only initialized once?
        Yes, in order to keep state. But your code obviously does not want to keep state. Just because you can call $hs->eof doesn't mean I would do that. Is HTML::Strip->new such an expensive call compared to $hs->eof that it's worthwhile to avoid calling it?
Re: Another Reason RTFMing Is A Good Thing
by derby (Abbot) on Oct 19, 2009 at 17:20 UTC

    L~R ... to go OT from your original post ... how are you downloading your data? Last time I did something like this, I found an exec of lynx -dump worked far better (quicker) than any combination of LWP and HTML::Strip. Just curious.

    -derby
      derby,
      I am using $mech->content() (WWW::Mechanize). This needs to run on windows but I could look into alternatives.

      Cheers - L~R

Re: Another Reason RTFMing Is A Good Thing
by Bloodnok (Vicar) on Oct 19, 2009 at 17:24 UTC
    Another good reason for RTFM'ing I haven't (yet) seen on this topic is, of course, to test the docs - by way of both accuracy and (in the interests of internationalisation) more importantly, understandability.

    A user level that continues to overstate my experience :-))
Re: Another Reason RTFMing Is A Good Thing
by stonecolddevin (Parson) on Oct 19, 2009 at 19:02 UTC

    Also awesome for scraping is Web::Scraper in case you haven't heard of it.

    mtfnpy

Re: Another Reason RTFMing Is A Good Thing
by Herkum (Parson) on Oct 21, 2009 at 22:10 UTC

    I was using XML::Twig to parse a document and was using flush to get rid of of pieces that I had parsed. However, I noticed that on the web page, where I was displaying my results, that I was getting a real screwed up web page.

    Twig's flush not only deletes the element from the tree but also prints it out. I did not read this part of the documentation when I working on the code. What I should have done(and ending up doing) was using purge, which does the same thing except it does not print out the element.