comment on

All,
I was recently working on a web scraping project. Everything was working fine until I started to intermittently get "Out of memory!" errors or "This program has unexpectedly terminated" (windows error). I check the process table and the program isn't even using 1 MB of memory. I ran it again and this time it runs fine. WTF I say to myself.

I think perhaps since this is fetching data from dynamically generated web pages, I am intermittently getting some bad data. I capture all the data to local file before processing. This way, if I encounter the problem I should be able to reproduce it and then debug it. Ok great, except the same data doesn't produce the same results. Getting desperate, I decide to upgrade perl from 5.10.0 to 5.10.1 as well as all the modules I am using. Same problem. WTF I say again.

I start sprinkling my code with a whole bunch of print statements (poor man's debugger). I determine that a particular sub is entered but is not left. Ok, so what is going on in this sub - very short, looks straight forward and then I spot it:

my @clean = map clean_html($_), @dirty;
[download]

I look at clean_html() and it too looks harmless at first:

sub clean_html {
    my ($html) = @_;
    state $hs = HTML::Strip->new();
    my $clean = $hs->parse($html);
    $clean =~ s/^\s+|\s+$//g;
    return $clean;
}
[download]

I am wondering if this is a 5.10.x bug using state so I change it to my and the problems go away. Yay, isn't everyone going to be happy I helped find a bug. Let me go gather the information on HTML::Strip so I can include it in the bug report. Wait, what's this?

HTML::Strip maintains state between calls, so you can parse a document in chunks should you wish. If one chunk ends half-way through a tag, quote, comment, or whatever; it will remember this, and expect the next call to parse to start with the remains of said tag.

If this is not going to be the case, be sure to call $hs->eof() between calls to $hs->parse().

D'oh! The fix was simple - just add $hs->eof() in clean_html() and I could re-instate my state optimization. The reason it wasn't consistently reproduceable even with the same data is because I was populating @dirty from a hash and it wasn't always coming out in the same order. I spent far too much time debugging this simple problem that would have not existed if I had RTFM'd better. Any such stories you care to share?

Cheers - L~R

Old_Gray_Bear and toolic point out in a /msg that [doc://state] is not linking to the correct perl documentation. I am leaving the link as is since I believe this might be an issue with PerlMonks (see What shortcuts can I use for linking to other information? and How do I link to the Perl documentation?). The current working link is state.

In reply to Another Reason RTFMing Is A Good Thing by Limbic~Region

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.